DOC 6 BE IT 1BS0HB 



> ID i 183 17B 

TITLE ^ 

INSTITUTION 
. SPONS AGENCY 

PUB DATE 
GRANT 
NOTE 

EDRS PRICE 
. DESCRIPTORS 



Hashing 
sh^ngton. 



RejpoH of ifce Conference o: 
User-oriented software (Alexa 
November B-10, 19"&71. 
American Statistical Associat 
National Science Foundation, 
Social Sciences. 
Nov 77 

isE-76-15271 • 

257p. ; ; Figure 2, page 2 35, 

MF0VPC11 Plus Postage.' 
*Census Fiqufces: Data' Bases 
Processing: TDisclcsure: I 
♦Information/ Systems: *Sta 
Studies . s . 

^ABSTRACT 

One of. four projects conducted by 
Statistical Association (ASA) in cooperation with the /Bureau of, the 
Census, the. conference explored the* most important an4 'fruitful 
'research And development topics within the* use r*- oriented software 

domain, It« objectives were to (11 develop .recommendations on 
«■ mechanisms to improve access to arfd use "of ma chine ^readable Census 
Bureau data; (2) identify software systems needed /to assist the User 
community, to more .easily organize, tabulate, and present census data; 
(3) review possible additional means for' user access to census. data ; 
(<l) identify and recommend specif ic research an development 
activities that would lead to improvements in the, access to and 
utilization of- such data: and (5) develop specific recommendations to 




AS 
su 



A Apr proceeding with an expansion of its aTogram. This 
mttarizes each day's session, .as well as discussions and 
recommendations of the conference groups and sub-groups, 
list the participants , provide, "background -and b^licgraph*: 
describe the conference agenda /contain the papers submit' 
' offer ia Census Bureau view of the activities, discussed by/ 
participants. (FM) 



sport 

ppendices 
c material, 
ed, and 

the • 



********** ********* ******************** ************!*** it ********** ****** * 

* Reproductions supplied/ by EDRS are the best that can be made * • 

* ' ' ^from/the original document. / - *~ ' *.•• 

4*****************************************************1***************** 



/ 



;:Ric 




it*:. 



U.S. Off PAtTMfNT OF HEALTH, 
EDUCATION ft WltFAM 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS DOCUMENT 'hAS BEEN RE*RO- 
DUCEO EXACTLY AS RECEIVED FROM 
THE PERSOlQ OP ORGANIZATION ORIGIN* v 
ATlNG IT POINTS OF VIEW OR OPINIONS 
STATED DO NOT NECESSARILY RE PRE- 
SS NT. OF F ICIAL NATIONAL INSTITUTE OF 
EDUCATION POSITION OR POLICY 



\ 



\ 



A 



J. 



< REPORT OF THE CONFERENCE ON 
DEVELOPMENT OF USER-OR I ENTED SOFTWARE. 



0 1 d Town Ho 1 i day I nn 
Alexandria, Virginia 



November 8-10, -1977 



- - r 




"PERMISSION TO REPRQPUCE THIS 
MATERIAL MAS BEEN (3RANTED BY 

B<taar M. Bls'gyer 



*0 



r 



AMERICAN STATISTICAL ASSOCIATION 
• 806 - 15TH STREET, N.W. 
WASHINGTON, D.C. 20005 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (BRICV 



, f 



2 



t - ' — T 



\ 



CONTENTS 



Page Ho. 



Introduction 

Background , 

Purpose 

Participants. . ..' 

Conference Format. . * /. . 
. Organization of Report, 



Opening of Conference and Presentation of Papers. 

Opening of Conference 

♦Presentation of Papers., \ 



Summary of the Major Recommendations. 

Institutional Recommendations 

Strengthening ther Interface.*. . , 

Serving the User Community 

Technical Recommendations 

Data Dictionaries 



y 



Data Extraction « 

Geographic/6ase Files and Other Geographic Reference.;.. 

Gen£ra+t£ed Tabulation Systems % ... 

Data Base Methodology* • i .»•••• 

Time Series % ... t . 

Hardware. • 

Possible Areas for Future ASA/Oe^sus Cooperation/ 

Group Discussions and Recommendations* 

, Data Organization Group ••••••< 

Discuss loft • 

/Presentation of Recommendations* 

' Recommendat i ons • < 

Technical Recommendations • • .• • • • 

NSF/Ceft^tis/ASA Research Programs for Fellows 

ita Tabulation Group...*. ♦ 

Discussion. . .' 

Presentation of Recommendations ^ < 

Rafcommendat i ons /• • 

)ata (Presentation Group* 

Drscussion 

Education ancT Communication in the Acea of Data 
» - Presentation < 

Data Selection find Requests for Data 
Data £ditihg< 
i Color.afid Graphics. 

User Interface /and Service Organization* 
Presentation of Recommendations* 
Recommendat i ons < 
User Cducattor 
, Hardware. . 
/Software.... 
Data Requirements 
Organization.).. 



Jt 



• 1 




1 

X 




1 • 




2 




« 




* 


• 


3 . 




3 




4 


4 


6 


» 


7 




7 




8 ^* 




8 




8 




8 ' 


9 


9 




.9 ' 




.9 




9 




9 




10 t 








10 




10 ' 




10 




15 




15 




17 




17 




18 




18 




23 - 




24 




26 




26 








26 




27 ■ 




28 




29 




30 




30 


) 


31 




31 




31 




32 




32 . 




32 





3' 



* 



Page No. 



V * Discussion and Acceptance of Group Recommendations by 

the Conf erence. ♦ • • ♦ * - 

Submission of Preliminary Group Recommendations. * 

Acceptance of Final Group Recommendatiorfs by the 

Conference ................ • * 

. ' * * 

Appendix A. Names, Affiliations, Addresses and Background of f 

Conference Participants • 

Appendix B. Final Program for Conf erence* on User Oriented Software ... 

i 

Append ix'C. The Organization, Tabulation and Presentation of Data . 

State of the Art: an«Overview by W.llllam T. Alsbrooks 

and James D. Foley » 

fWNeeds for and AvallablUtY of User Software«to 
Process and Analyze Census Bureau Machine Readable - 

Products by Warren G. Glimpse , 

Census Software Needs of State and Local 

Governments by Harold B. King « 

Business Use of Census Data by Richard B. Ellis' > 

Organization of Data: Consideration , Relevant to the 
\ * ^ Development of User Oriented Software That Might • 

Enhance the Utility of Data Gerterated by the Bureau 

of the Census by Mervln E. Muller j 

Organization of Data for Census Users by Bruce 

x Carmlchael, Warren Besore and Kam Tse^ 

Generalized Statistical Tabulation by Hugh F. Brophy..'... 
Generalized Tabulation Systems at the U,S. Census 

Bureau by Melroy Q,u#sney 

Reference Materials Used by Robin Williams and 

Lawrence Corriish, Speakers at the Data Presenta- 

* • tion Group * 

Materials Prepared for Sub-group* Discussions by 

Shirley Gilbert, Gary L. Hill and Rudolph C. 

Mendelssohn • 

Appendix D. Status Report on Selected Census Bureau Activities 



33 
33 

33 



. 37 



43 



47 



100 

126 
146 



15.7 

184 

204- 

210 



227 

228- 
247 

249 




4- 



• 4". 



ft, 



ERIC 



/ 



REPORT OF THE CONFERENCE ON THE * 
DEVELOPMENT OF USER-ORIENTED SOFTWARE ' 

■ I ' - • • ' • 

INTRODUCTION 

The Conference on the Development of User-Oriented Software was held at 

' ■ . - 

Stouffer's National Center Hotel in Arlington, Virginia on November 8, 9 and 

10, J977. * • ..... H' 

Background 

• This 'conference is part of a 3-year program' conducted by the American , 

Statistical Association (ASA) in cooperation with the Bureau of the Census,. 

and supported by the National Science Foundation and the Bur^uNof the Census* 

Its purpose is tt> explore ways of improving the national data base* through a 

program of research at the^ forefront of statistical techniques applied to the 

social sciences, and by supplementing and sharing with researchers in a large 

data collection agenty the experience of senior social scientists and the 

training of graduate students in statistics, economics, demography, computer 

science and related, areas . The Conference on User-Oriented Software is one of 

four projects being conducted under this ^program* The other projects are in 

the research areas of (1) seasonal adjustment of economic time series, (2) edit 

research of computer output and (3) the development of new population pro- 
i 

jection methods fo^States and metropolitan areas* - 



Purpose 



The conference sought the advice of experts outside the Census Bureau on 
the most important and fruitful research and development topics within the. 
user'-briented software domain* Five specific objectives were posed: 
v 1. To develop recommendations op mechanisms to improve access to 
and use of machine-readable Census Bureau data, especially 
through the development of user-accessible software. 
* 2. To identify software systems needed to assist the user community 
to more easily organize, tabulate and present Census data. 



^he conference is supported by NSF grant #76-13271. .The views and 
recommendations herein are solely those developed the. conference and 
not necessarily those off the National" Science Found^ion. 



1 • 



\ 3\ To review possible additional means* for use^ access t;o Census 
Bureau data other than the three identified software areas of 
data-bAse management systems, graphic systems and generalized 
. tabulating systems* ' % , 

4#' To identify and recommend .specif ic research 1 and development ♦ 

1 activities that would lead to improvements and simplifications 

* - % * 

* in the' access to and^ utilization of Census Bureau data. 

»■•.. ■ * * • *i 

5^ To develop specific recommendations to the ASA^ for proceeding 
with an expapion of^^s program. ^ 

Participants 1 4 , . ■ ■ 

Conference participants yerer selected and invited jointly by the ASA 
and the Bureau (Appendix A) . TKfc selection process balanced participants by 
/professional backgrounds as well as by^area? of aj^ 

.included statisticians, demographers, computer scientists, sociologists -,- 
geographers and others; their ^experiences ganged through business, government, 
'academic and research applications. Some 35 people attended from outside the \ 
Bureau, with another' 20 coming 'from inside the organization. , 

Conference *Forifiat -* - # f \ _ 

The format of the conference was organfzejj Ground a view of generalized ^ .„ 
software^fpr t\ie Census usea: in three, parts--data organization, data tabula*- \ 
tion and data presentation (Appendix B) . .Data otganization encompasses public- 
use microd&ta fiies and summary files in terms of their preparation anfl 
.organization £or better access by /the general user. Data: tabulation is, of 
course \ a large part of the special processing of Census files. Data pre- 
sentation is viewed as including micfoform output, graphics, mapping and all 
types of pub 1 ica tion- qua lit^ presentation forms. • - 

The first day of the copference was devoted prilnfl^rily to the presentation 
^ of invited papers. The second day, the conference participants divided into 
+ three groups under the headings ^given above of ,Data Organization, Data 

Tabula^Lon and Data Presentation; each gi;oup separately prepared recommendations 
to be made to and from the whole conference. The third day was, devoted to 
the presentation, discussion and refinement of the recommendations. 

Much of the original content planning fojr this conference was 
accomplished by William Alsbrooks and Kam T8e «f the Census Bureau; further 
planning also included Bruce *Carmichael, Lawrence Cornish,. Jame,s Foley 

1 f .6 < • ■ 



ERIC 



and Melroy Quasney. Michael, Garland Warren Glimpse and Paul Ze'isse t, gf^ the 
Bureau's Data User' Services Division, also made substantial contributions. 
Daniel Relies, of the Rand Corporation, and George Heller, of the Census 
Bureau, served as co-chairmen of tb,e conference and £ere involved at M.\ 
stages, including respbAgibi-lity for this report. 

Organization of Report - 

* 

Following this introductory section, the first day's session is summarized 
infection II. It^was not the* intent oj the conference to include a verbatim 
transcript of all proceedings, although the formal papers and oth^r materials 
presented by the speakers the first day are reproduced or completely referenced 
in Appendix C. Nevertheless, it is important that .the reader be given some 
Isense of the range and spirit of the sub-group discussions the second day and 
during the presentation of their recommendations to the plenary session for 
review and 'perfecting the third day.\ * * ) 

Accordingly, section IV summarizes for each sub-group highlights from its 
day's discussion and formation and presentation of its recommendations. Section 
V covers the discussion and acceptance of final recommendations on the thircl 
day. 

Section III summarizes the final recommendations of all three groups and 
relates the conference's findings to the objectives posed at the beginning. 

Appendix A lists the conference participants, providing appropriate 
background and bibliographic material as well. Appendix B describes the 
conference agenda. Appendix C contains the- papers submitted by all of the 
speakers and some of the participants. Appendix D is a "Status Report on 
Selected Census Bureau Activities, 11 to provide the reader with a Census bureau 
view of many of the activities discussed by the participants. 

ii 

OPENING OF CONFERENCE AND PRESENTATION OF PAPERS 
Opening " : r ~ " 

The conference was opened and a welcome extended by the Directors of the 

American Statistical! Association and the U.S. Bureau of the Census. 

Fred Leone, Executive Director of ASA, traced the historical effort to 

improve the social Science data base, of which this conference of prime movers 

in that field is feut^ene facet. The dual purposes of the/ conference, he 

explained, are directed toward developing and perfecting, sdjftware to enhance 

the use of Census Bureau and other data by the social sciences and to examine 

\ . . " r ' 3 m 

.... / • 

7 - . ~ • 

\ ♦ ...» 



N V ' « 

future needs in terms of users' requirements and necessary research. 

Manuel Plotkin, Director of the U.S. Bureau of/ the Census noted that among 
the Bureau's top goals and priorities are (1) the uses and applications of 
data and (2) the improvement of data-processing systems to enhance timeliness, 
accessibility and relevancy of the data. He called attention to the Bureau's 
increasing^ workload and corresponding, pressures ;' preparation ,for the 1980 
census is under way, the 1977 Economic 'Censuses are/about to be taken, there 
will be a census of agriculture for 1978, the Current Population Survey is 
•.being expanded, etc. There have blen software' meetings in the past, but this 
iCconference is the first one in which there has been a joint meeting N among 'data 
users outside the Bureau and data users and computer hardware and software 
staffs from within the Bureau. All have different perspectives to contribute. 
In the area of generalized \software, the Bureau hopes for some innovative 
developments that will increase usefulness and productivity.. 

Presentation of Papers 

Following the conference opening, ten speaker* presented, in full or in 
summary, papers which had been prepared and distributed to the participants 
and are reproduced in Appendix C The first four papers were designed to give 
a general view of the state of the arts, the need for, and availability of, 
user. software as seen by the Census and the uses and peeds as seen by other 
governments and in the private sector. 

William. Alsbrooks , who is in charge; of the programming staff that develops 
software for use within the Census Bureau* presented (in a paper written with 
James Foley) an overview of the three topics to be addressed by the conference, 
namely, data organization, tabulation and presentation. ' 

Warren Glimpse , .Assistant Chief of §ie Data User Services Division, 
Bureau of the Census, focused on the supply of, and demand for, software for 
r ♦improving data use. He reviewed the availability of machine-readable resources 
and existing software. He emphasized that, while there are some unmet' needs 
for user software, there are many related 'requirements for effective '^use of 
Census Bureau machine-readable products other than software. Major problems 
in using these products are not only software, but also the file structure, . 
documentation, and archiving procedures followed by the Bureau, or the absence 
of them. 



i 

' 8 



ERIC 



1 



Harold Ring , who directs computing services for the Urban Institute, 
talked about the software needs of State artd local governments. He observed 
that iin 1969 meettngs with the Census Bureau to request software support for 
users had negative results, but the Bureau' 1 * gositi^on has changed since that 
time. There are approximately 38,000 general-purpose governments in the 
United States^ rqughly 35,'00<^of which are small municipalities and townships 
th4t need data to meet Federal grant-application 'and other requirements . It 
mu?t ; be assumed that many of these governments have litfcle or' no computer 
capabilities, although there is a rapid expansion in\th€{ use of mini-computers. 
Users still Will need guidance on how to apply census data to local problems. 

* Richard Ellis , a marketing manager for the American Telephone-pnd 
Telegraph Company, emphasized in a review of his full paper ^phe variety df 

* corporate uses of census data he had covered. These 'are ^ood analogs fo^ 
general business usage of research information ; 

^THe^last six papers of the opening day provided opportunity to hear from 
a representative of the user or technical software community and a Census 

"Bureau speaker on each of the three topics the individual sub-groups would be 
working on the second day. The first two .speakers had prepared papers on .the 
organization of data. * \* ^ * 

Mervin Muller , Director of the Computing Activities Department of the 
Wprld Bank, posed a number of question's about data organization and outlined 
research areas that would lead to fruitful* discussion within and beyond the 

time -period of the » conference. 
> 

v Bruce Carmichael , leader of the Central* Data Base Group at the Census 
Bureau, discussed the importance of ^data organization and the neec^foV more 
sophisticated data sch^me|(3 and accessibility, and stressed the Bureau's, need 
for users 1 help in this direction. * ^ • * * . 

During the remainder of the first day the next two speakers addressed 
statistical tabulation and the final two speakers, statistical presentation.- 

Hugh Brophy , Chief of the ? Systems Developmenlfand* Programming Unit at 
the United Nations Statistical Of flee, noted the magnitude of the processing 
involved- in a national /census. The resultant informatidn should be regarded 



as a valuable national 



resource. In practice,, there tends to be a loss of 



information in summarizing the statistics, difficulty in linking with local' 



9 

ERIC 



data,' and gr6at expense when special tabulations are required* Flexible 

* ■ * • ♦ * 

tabulation systente are a partial solution. # tie then summarized his paper, 

dealing with generalized tabulation systems* 1 , 

Melroy Quasney , of the Census Bureau's Systerad Software Division, 

said that the Bureau, in solving its own problems, hopes to supply users 

rfitif tools th^t may* include a problem- oriented language • 

Robin, Williams , of the International Business tyachines Corporation 

(IBM) , discussed a^\ise r-oriented systems approach for software and hardware 

developed in IBM's Research Division. He illustrated with slides IBM's - ■* 

Geod^ta Analysis ~and Display System (GADS): it builds and maintains files, 

extracts data, pad then analyzes and projects them in tabular or graphic 

form on a colour display. Topics suggested for further discussion were: 

^graphic- terminal functions teseai^h, software for interactive graphic- 

terminal support, the feasibility of 'supplying data in the format and form 

requested by users, and'the provision of services to a requester, e.g., 

on-line query facilities, plotting facilities, etc. , for census data. 

Lawrence Cornish y of the Census Bureau's Systems Software Divisi^j^ 

pointed out that the^classical approa^fj to data publication is to/ deliver 

them in tion-machine- readable form. Using materials from an internal Census * 

Bureau study, he described th>e hardware now available for a wide variety of 

alternative data delivery systems including graph log. . 

• „ * III 

■SUMMARY OF THE MAJOR RECOMMENDATIONS 

Before going on to the discussions and detailed recommendations of the 
three sub-groups of the conference, which are pet out in section IV, it would 
seem helpful to try to pull together and highlight the mfst important 'of 
those recommendations. The reader > of course, is urged to consult section IV 
for the full effect. Particularly noticeable at the onset is the high ley el 
of overlap and concurrence in the three sets of recommendations, the more so 
in view of the separation o£ the three groups when their recommendations were 
being drafted and the three distinct software areas represented. ^ 

The recommendations of the Conference were far-ranging but certainly not 
beybnd the general guidelines set as objectives for the conference. In 
covering the Census Bureau's user eofitware, and the distribution of that 
software*, it was inevitable and natural to discuss the products and objects 

.6 . 

to 



of that sof tware— *the data. 'At times also it was necessary to discuss and 

4 « ft « 

cover the associated *reas~of use* documentation and user training. It could; 

» i» ■ „ 

not be expected that all 6f the participants would be fully cognizant of all 

. ' ' ' ' * 4 

of the Bureau's efforts and plans in all 6f these areas. While an attempt u 

, *■' 

was made in sonje of the Bureau's first ^ay presentations to present some of 
this background, all of the relevant argas could not be anticipated. So, in 
order to give the reader of this record a brief overview of these activities, 
Appendix D has been included. This^appendi* is not an answer to the conference 

4 .« 

recommendations or a solution to the problems raised, but is background ^ 
material that could have been provided prior to the conference^ . 

As the general discussion of the third day showed, and as reflected in : 
the recommendations, there was a atrong sense of concern by many attendees 
that the Bureau would not bfe adequately prepared to meet user demands in the '- 
1980 f s. Many felt that an examination of the entire data delivery system 
was necessary, not, just the software development component. , To t\i& extent 
that these concerns are actual, the conference participants will await a 
Bureau response; to the extent that 'these cefneerns represent a lack of 
knowledge of the .Bureau's activities', th% par ticipa,nts will expect- a better 

j * 4) 

educational effort by. the Bureau. ' ' *. 

The recommendations fall into three types: institutional, which involve 
largely improved communications between users and the Census Bureau; technical, 
whi^h deal with the actual software development ; "and those particularly appro- 
priate as further ASA/Census endeavors. ' 

Institutional Recommendations, 

Strengthening the Interface K ■ , 

The need to strengthen and broaden the interface between users of census 

* * * 

data and the Census Bureau suggests that: 

C . 
* User needs should be monitored. ; 



* An^ongoing assessraerlt of user needs for software ^should "be conducted. 

* User comments and # evaluations of software shr6uld be compiled. 

* A users \ group on user software should / be formed. 



User education and training must be expanded.. ' 

* Materials and training courses for user .education should be developed. 

* User-orient6d documentation and trainings piaterial on data and soft- 

ware capabilities should/be glared to various levels of technical 

and processing proficiency. 
r • 7 



3 



y 

•■ • i ' • 

" ■ V ' • • 

^ Serving the User Community 

To better serve the user community, it should be determined v whlcti of 
the following should be considered: 

* A national census data center. •« 

* A consortium of users* A » ^ \ ^ * 

** A national network* A V 

\ ** » • * 

Technical Recommendations 

Possible technical solutions to a wid6 variety of user problems merit' 
the examination of: , t / 

Data Dictionaries 

Machine -readable data dictionaries must be distributed for each distri- 

buted data file* TH^i^ dictionaries must be accurate, up-to-date and machine-* 

portable*. The dictionary should include definitions and common recodes and 

^B^rovide easy mapping to data elements* The Census Bureau needs to work with 

existing .groups sucji as the Association of Publication Data Users (APDU) , 

th6 Federal Statistical Users 1 Conference (FSUC) , etc., that have already 

addressed the subject of terminology, conventions and definitions, to ensure 

that the data dictionaries are meaningful to users* The Census Bureau also 

should provide as detailed information as possible on its data dictionary 

plans to the user community as soon as possible* All software develo^VI " i 

users -taught to access data via a ^jatg dictionary to remove format dependencies 

% * f rom programs associated with reading ^ensus files* ' ■ 

Data Extraction . s * 

• Efficient mechanises and procedures should be established tp extract 

data for users and to managc^ftie response to such* requests* The Census Bureau 

■.. » ^ 

should support the development, with an eye to subsequent portability, of 

generalize^ extraction sdftware that will automatically provide a modified. 

data dictionary. \ " 

Software should be developed and made available -by the Census Bureau 

for handling the most basic and simple types -of data retrieval and presentation* " 

Research should be conducted to .determine the special machine -readable 

♦ ■ ' •> 

files (extract files) and extraction programs that should be produced for 

special program cotrtplianc^. , - 

The extraction in machine -readable form of the full array of census 

. . „ •' • •• 

• . 8 , . 



5 



9 

ERIC 



dkty aggregated according to us^r-d^fined geographic area^ ifhould correspond 
to the 'full rattge qf Information now. availabldtA?* standard Censu6-.de find! 

gtographlc ireas, " • * ' , • 1 . 

• ■ » i . > • . > 

Geographic Base Files- and Other. Geograghlt References « 

User specif icatitfn of tatmlation areas ir\ pterins of. coord inAtfeS should be ' 

allowed* „ This wouJ.d require A^niform high standard for* coordinate's in 

• • ♦ " • ' v. . - - ■ . * 

geographic base files (gbF*s), and GBF coordinates should be corrected # topo- 

logically and .cartogtaphically, • *, * r r t 

/ *. A machine-read&bie data b*se should be' developed that d&fineS, changes 

aifd* equivalencies in statistical areas*' '„ : - • . < 

The Census Bureau should proVidfc* separate mafchine-%eadable files of . * v 
spatial definitions (e.g,, polygonal coordinates or raster) for all' statistical 
areas. * . \ , % • ' 

Generalized Tabulation Systems <r ^ 

; The tabulation group made recommendations for research, development and 
, ■* * * • • 

general support in the area of generalized tabulation' systems. While approving 

• *. • \ ' * • 

the Census Bureau 's 'efforts to elicit users V needs for -this type of software, 
t V ' * . , 

tl\ey listed a' number of:*are\s that would Require research lipfore a Bystem 

could be put in place. For exanfyile, a generalized user system Would have to 

interface with data dictionary systems,, and .tjtiese dictionary systems have 

hot been defined for the Census users., ' ... * 



Data Base Methodology ^, 



ageaei 



As a vehicle for promoting research on advanced data base management 
technology, it was recommended that an efficient acces^ and transmission 
system for user requests concerning specific places / types of persons' and . 
characteristics be investigated. A capability tp flexibly combine persons 
into alternative social units , was described as highly desirable and techno- 
logically worthy of research. 

Time Series 1 • 

The data organization groiip recommended that the costs and benefits of 
a time series capability be explored, jj ^ 

Hardware r * - , * 

'1 The Census Bureau should investigate the potential *&le o^toinicomputfers 
^nd microcomputers for data portability and for access and analysis o£ census 
data by users with limited resources, . / £/ 



* •' ■ » 



Possible Areas for Future ASA/Census Cooperation 

All three groups recommended National Science Foundation support for a 
Research fellow at the Census Bureau, One proposal involved basic research 1 
' # * ■ * projects and feasibility Itudies in data organization ^ind, delivery systems. 
Another recommendation was that NSF support research and development into 
* efficient, effective and statistically useful technique for the generation of 
' . • i statistical tables* 

v iy • ■ # ■ 

, * ' - ^ROUP DISCUSSIONS AND ^COMMENDATIONS 



/ 



er|c : 



.As noted in the Introduction, it is not intended to reproduce in this 
conference report a verbatim account of the proceedings of each of the three ^ 
sub-groups; What is attempted in the following pages Is to give the reader a 
feeling of the matters each group addressed^ how tflmy cpvered them during 
their discussions and finally-, in each group's own language, the recommenda- 
tions they agreed to and /how th^y presents* them to the full conference on the 
third day. " % \ '.' 

v Da tft Organization Group 

Dftcus^ion f ' v 
1 

The data organization group began by pointing to certain areas that it ' 
would like^to coveif and directions it might wish to take in developing its 
findings- and recommendations. Included in these were: 

* More-flexibility in the organization of census data to f 
accommodate the broad spectrum of user needs. 

*' More detailed information and links between relevant jjata 

at a person or block level (base level) . • 

. ■* * 
*, Easier access and utilization of data; data should be 

made available more quickly to users who request it* B^|er v 

. .4 

documentation vJould reduce the amount of time spent interpreting 
census data. * 

* Census data should be able to. accommodate and be accessible 
to both sophisticated and 'unsophisticated users, or large vs* 
small organizations. 

* What the current state of the art is and what advances can 
be tnade based on technology available today. 

10 . * 



data that would be available in different formats and still 

preserve confidentiality* The Bureau might provide detailed 

and specific information tailored to particular imbeds without 

violating disclosure rules* 

*^ Requires data from demographic, housing /and economic 

areas ; uses the 1970 summary tapes and^jwoujd like more softjwtre 

for them. At present, much tape handling is refquired SeforJ 

* I "1 

getting informatidn; very oftefo has to regroup/ data. It -would 

be helpful, if the data contained different codes^fot stfch 

things as schpol districts atjid police precincts. Needs publ.ic- 

use dat||provided faster than it is. SPSS ii satisfactory i:or inuch 

of the Jlalysis. Is unfamiliar with new graphics developments 

and applications. K \ - 

< , * Stressed concern for the unsophisticated users* in pmall* 

organizations or small branches in large organizations, thajt 

require a lot of assistance in utilizing census data. There 

should be some way for\a user to produce some quick exploratory 

work in only a day or tpo of planning. Would like^nych faster 

access to data; delivery is slow. Also would like more^fjlexibility 

in data and, approves* of the notion of a small common ^denominator. 

Often needs different (?ub-pbpulations and geographic boundaries 

for different purposes and has a problem with Census's divisions. 

Complained about haying to alter the existing data too muph to 

meet his specific needs. 

*' Would like to have ,the data available faster. Produces 

cost, estimates of legislation proposals and needs the best avail- 

able data at the current time. Hals a limited amount of time and 

so has to focus quickly. Perhaps an on-line system for non- 

programmers with an easy access, to a big data base might be the 

answer. Another problem, is trying' to locate, the data. Suggested 

some sort of data library, perhaps another dn-line system which 

points to where the data could be found would be helpful. Would 

like the Censlis Bureau to'maintain its professional integrity,. 

as well as treat its users. more equally. Recognized that units 

of analysis are always^phanging and that constant updating is ^ 



12 



16 



* What can.bg. accomplished by the tirate of the 1980 or the 
1990 census; what research tools exist and can be immediately 

' jt utilized* 

* The Bureau should be more concerned with user neecfs. 

* The group should think* about goals far in the futtire. , 

* How to approach idealistic goals with finite resources. 

«i * * 

"V * s What research should be done by Census, NSF, or others; \ 

' should currently fundable research projects be considered? / 

Following this general discussion, .individual users in the group w#re 

each given a bVief .time to explain their own interest in, ne'eds^for, and 

Concerns about, data orgtniza^Wn. Their respcmsefi can be summarised as 

follows.: / 

• y *, Wants a more detailed public-use sample at smaller/; 
> »- ■ . v ■ 

1 geographic level^ but does not need aiiy more software* They\ 

Census Bureau -should not get involved in selling softvaffe; what , 

'seems most important is getting results as quickly as possible* * -\ 

Also, the available } data are r getting farther and farther away 

, ' from what a 1 city needs , f 

• * Involved ^in planning for the 1980 census* . v Interested in 

dat^organizatAon suggestions, Jiow the Bureau can salipfy itsSisers 

ully and what its job should be in research* Would like 



ideas for improvement of the census internally as well as for 
services users need externally* ' 

* Watks with population samples ranging anyvrtiere from 
2,000 to 1,500,000, in a planning capacity. WouJ.d lJJke to see more 
consistency *in data and. better documentation of census data and 
how the^are organized.. Frustrated when determining the difference 
between census first and fourth counts due to confusing documents. 
Would also like greater distinctions in race, such as black vp. 
brown. Interested in more detailed information at the census 
tract level. f / 
, * There is a barrier when dealing with the smallest geo- 
graphic common denominator; intef^ted in county information but 
that is not always the case. Called for a diversification of 

11 

/ 

' is 



d£ what is goin£ on* J 
Census Bureau to get into the software 
and (grater consistency of dataware 



'required. to keep abreas^f 

* Doe$ not want the 
business • Reorganization 
needed. '■ * 4 ^ 

* Concerned about what kind of software is needed to help 

users. Some of the Bureau f s internal work might be useful to 

* * 

outsiders in solving sdmi of their ptfcbleras. The Bureau has a 
number of areas where improvement is needed, Especially in its 
relationship* to outside users, , 

* Interested in reorganization of data. Has problems with 

r • 

public use and summary tapes of the nature discussed by other 
participants v . * 

* Data content is insufficient 4 ; Would lital to see the data, 
edited and documented more ^fectively, ajjptd users be/ advised promptly 
of data changes. Would also like more group data and tapes avail- 
able on more of different structures by different characteristics 
and areas. Has specific and varied interests, and structures fire 
always changing. Asks for more dat^and more flexibility in the 
dat£ available, ■ ■ 

*' Has problems converting* data from nito-machine-readabte to 
machine-f eadable form and tiould be'Mnterested in software' that 
could make ,this conversion. Sees the household as an important 
unit of analysis; has a gr&arlf need' for more househpld dat£ at many 
different levels of geography, * 

* Users need specialized information for specific areas ' 
produced by people qualified, at manipulating ^gigantic data bapes, 
and flexibility thaj:- allows the aggregation of people ai^ 
geographic units* *There is a need for greater detail at smaller 



aire^ levels, plus the 



ability to strip off specific things of 



interest from census dita, all as soon as possible. It takes 



too much time to wade 
needed data, 

* v There is a nee 
bility off data; a data 



through unwanted information td.extjact 



i for software research to provide flexi- 

« 

base^is an ^Important step for this. Data 
13 



17 1 



base technology is very. important Ui relation to user-defined 
ar£as of analysis. The Canadian approach is data bases with link's, 
although sequential files are lacking. -There are microdata down to 
household and person 4 * levels, and normalized rectangular files are 
produced for users* . Some li£ik keys are provided that users can 
t^e into. A lot of customized work can he oone that Includes 
"data provided , by users. ■ \ * • 

) * Much of the .techno logy, ^e.g% geographic base files', 
UNIMATCH, etc., already .exists to solve the problems discussed, 
the Census Bureau should avail itjdttf of this technology in 
solving many of its users 9 dkta problems. The time lias come for / 
the Oensus Bureau to. get 'more involved in distributing specialized 
<Sta to its broad spectrum of outside users. There* is a need for 
data at the person-within-household level, in whi,ch the pexson is 
the basic unit but has a linlc^to his household with some kind of 
identification of the type of household. The structure of the 
1970 housing file is unsatisfactory; the person file and household 
file appear to have been done* by two completely dif f ef enf^foups ^ 
i*e., more cohesion is needed. The Bureau should have a data 
dictionary similar, to one provided by 1 the NSF. 

* Users need flexibility of data, available qtii^tcly, 
aggregated in a variety of ways, and in small and. large groups. 
Users want to be able to submit a request to the Census Bureau and 
gfit exactly what is ordered.- Cross- tabulations are fine, but 
availability and accessibility are the keys^ There is a need for 
rectdjagular files. Cross-case analysis would be a useful tool. 

' ^ The Census jBureau must get into a data base system so *Lt 
can handle users •/requests. In view of the significant time'lag 
involved in this process, perhaps there could be a public-access 
data base system through which user6 could get directly at the 
data without having to go through Census bureaucracy. The Bureau, 
cannot ^presume to guess the cross-tabulations that people need. 

* A data base system ^should be subjected to cost/benefit, 
analysis, and tile state~of*th6-art in dat^ base technology 



should be examined* Data might be considered along with data 
1 organization. What is^the cost of multiple structures, and adding 
identifiers to data? Should the Census Bureau, transmit raw 
data or the results of processing? (There are ways, to measure 
, this),. "Should data be given on a magnetic tape or perhaps ove,r* 
i a network, such as telephone .l^tnes? Is mylar tape wanted, and 

how would this, be decided? What about modtes of storage? In f ' 

any case, there are many things to consider before making any 
significant changes^ Better documentation is in order, perhaps 
in the form of software, and data definitions for .processing . 

should be include^. 

* * 

Presentation of Recommendations 

. ■ • . >, 

In presenting their recommendations to the full conference for review 

' ; ' " . - ' :^>f 

and general approval, members of the data organization group pbthted out that 

the objectives set for discussion of software were flexibility, accessibility 

time dimension and modes of storage. The group did not specify long-term ^ 

or short-term goals, nor apply these measures to any programs. -IJxpense; 

difficulty and serious technical constraints are involved in this^area, 

but there should also be an awareness of the research already reported in 

the literature. Further, the conceptual differences among "techniques need 

to be understood. There was a consensus that there should be more work in 

the area of disclosure analysis tp/determine how more data can be released 

and still maintain acceptable leWls of conf idetftiality . 

Recommendations , ' „ • 

Bureau of the Census data are an ^invaluable national resource . Our 
recommendations are intended to achieve modern and efficient yse of this 
resource, by the broad ,and varied spectrum of users deffl^ndent upon it. 

There is a real concern that, * failing Aggressive and well planned 
changes in the Bureau's perceived mission and procedures, there is a 
significant risk that ^ will be unabte^to meet the Obligations placed on 
it in the 1980 's? The specific areas of concern include: * 
* Incomplete knowledge of ^he^adeds of present and 
future external useps of census^ datdv 



^ * Lack of forceful developmental efforts to ensure that ^ 
state-of-the*art technology is brought to bear on meeting defined f 
^user needs. 

* .Lack of a systematic delivery system geared to a diversity ° 
of users witty a wide range of technical 1 and professional capabilities . 

In order to serve these users, we therefore make a set of prcommendatidns 

including a set; of technical innovations which would, lead the Bureau of the 

Census to take advantage of modern data organization techniques ♦ We" also 

recotijmend establishment of an equally innovative institutional setting which 

. c ..... . , f ■ 

will insure access to Bureau of th^ Census data by all \segment6 of society 

requiring such use. The thrust of the technical recommendations detailed 
below is toward greater usability of and access to the full complement of 
Censu? Bureau materials* "Although we fully recognize that no data organiza- 
tion schemes, delivery systems, or presentation techniques can be allowed < 
to violate individual confidentiality statute?, we nevertheless believe that 
current access to microdata^can be greatly expanded while protecting this 
confidentiality* % 

Further, we are aware that n6 existing security system is failsafe, 
including the present one. r However, careful security systems Van bte con- 
structed, while permitting greater access than is currently the case, to the 
socially critical information contained in the data files within the Census 
Bureau . m 

The thrust of the institutional reconmendations was essentially: . 

* Monitoring user needs . 

* Providing user training. 

* Giving timely service. 

* Pricing to support user access, T 
Whether organized inside or outside the Census Bureau, the institutional 

setting might entail: 9 r^* "-^-^r^^ ^ ♦ * 

* A national census d^ta center *nd/or * ■*;.*' 

* A consortium of users* and /or % • / * / 



* A national network* 



Each of the above should be considered and justified in terms ^of fcost and 
the best ways . to serve ^the user community. # % S 



Technical Recommendations ' ♦ 
, — ■ — — ^ 

The Census Bureau should research alternatives so as to develop and > 

<* . * 

Implement techniques and* software to provide the fallowing capabilities: } 

T. Flexible reconstltutlon of data about people Into a variety of 

significant social* units, 'such as families, households, dwelling; units, etc. 

This will* entail developing and retaining data that relate the person to tlie 

designated .social units, Jbxi example of *ne step in this direction is- the ^ ^ 

recent Bureau of Labor Statistics concept of a lf person in 9 family. *\ 

*■ * ' * * . • 

2 0 Extraction in machine-readable form of the* full array of census \ 

ddta aggregated according to use^defined geographic areas* This data- 
extraction capability' should correspond to the full range of information now S 
availably for standard Census- defined ged^*#phic areas . 

3. Efficient access tjo^and transmission of selected user requests^? 
concerting,! • / 

* * Specific places* i 
\ * Specific types ]pf people. • \ , 

* Specific characteristics. „ 
This will require that the Census Bureau aggressively pfomote research 
on advanced data-base management technology. ' 

4. Deplojniipnt of timely, accurate, portable, machine-readable data 
directories. 

fovision of user-oriented documentation and training material on 
data And software capabilities geared »to various levels of technical and 
processing proficiency. %N \ 

tn addition, the Census Bureau should explore the costs and benefits of 
developing and maintaining a time-series data capability on both a for- 
ward-looking and an ^historical basis. 

/ * * 

NSF/CB /ASA Research Programs for Fellows 

— Individuals should be assigned to explore technical as well as -cost 
benefits and alternatives for: / s 

1. More advanced disclosure analysis techniques to allow larger volumes 
of detailed public-access data. % _.. • 

2. Development of time-series data basa capabilities. 

3. Gathering and publishing information on present and projected Census 

* 17 



♦ ■ 

data use ltii 6V&*£ to determine alternative data organization strategies and 

* "'• • » 

delivery systems* V 

This work should review previous Buteau data-use research, and recognize 

that projected data use is influenced by present data organization strategies • 

< - . * 

Data Tabulation Group 

Discussion 1 

Br^penix)g 'the discussion the group recalled the general goais set for 
all the groups and stated them as: ; -S , % . 

Short-term: Rolfe of the research fellow to visit ±he Bureau of the Censu 
What would jtou have him do? . ' ' ~ r'\ * 

• I^ngXerm: What ought the N^F to fund in o jd^gr to promote .civilized ^ 
analyses of Burtu of the Census data? Are there enpugh research projects* / 
having VcQnnnQn needs that general software development will pay off ? 

It was noted that the resp<$ises should be based on what user & (rather 
than the Bureau) want to d6, but 'the Census Bureau would have ihjftit to the 
dialogue also* , 

Rudolph Mendelssohn , Assistaht Commissioner of the Bureau of Labor 
* Statistics (BLS), discussed a paper he had prepared and which had been 
distributed',^ on his agency's experience with generalized tabulating systems. 
They use TPL (table-producing language^*; it may be inefficient, but is widely 
used in place of programmer resources. He felt that it is essential to 
(1) identify the end use of the data, and (2) develop the necessary software. 
The BLS writes the user manual first, then the language, then the routines'. 

Gary Hill ,. Director of Information Systems fo» CACI (Consolidated 
Analysis Centers, Inc.), who had also prepared a brief paper for the group's^ 
discussion, said that his firm has generalized information systems that empha- 
size processing efficiency and has had favorable experience with data base 
dictionaries and interrecord* analysis. He noted *She problems of statistical 
accuracy inherent in cpr relation analysis, and suggested there be research 
in correlating household and person variables. 

In the discussion that followed, it was felt that the problem lies in 

the basic statistical assumptions (interpretation of valuer) , where the unit 

f> * 
is the same for the observer, but differs at various hierarchical levels. 

Some statisticians are working in this area, now. Several data users reviewed 
their approaches to census data and how editing and extraction were carried 

.18 



out to arrive. at end products. Among the needs and^individual recommenda- 
tions voiced' were jthe following: • 

* ■* Photocomposition^ software* V 

* Portability of data between ^software systems.^, 

* Hard copy available at *ihe locals level (fot small governments). 

* Clear statements* for the end useirs regarding the Census Bureau's 
allocation, imputation, and suppression practices. 

* Indication* of ipther^xjt problems in the data or the software 
(users often" lack hardware^of tware compatibility) . 

* Cross- tabulations wider than the Bureau's printed output . 

I , * Cross- tabulation in such a way that further work with the* - 

data is possible. v - * ♦ ' 

. ■ . * Attention to the microtechniques used in tabulation Algorithms; 
tim$ vs. space tradeoffs. 7 

* *Mfitke users aware of nonsampling error. 

* Software packages' through which the tables come close to - \ 
♦ . . • ■ 

tabular analysis and capture multiple-regression coefficients. 
* ■! ■ - 

* Tools that involve use ^by ttoncOmputer specialists. 



* Focus on types of software; and determine wha£^$n be done in 
these five fields; 



nd determi 



\ 



- Tabulation froA basic recprds (use- the Ceisus 



'Bureau's system^i^ it is 'quick <£nd cheaply 

- ProWde a^general tabulation §^stem fxtyiljxe public-use 

sample. | # ' % ^ 

- *Make .the basic record tapes ayailab£e-f3r tabulation. 

- Computer mapping and charting. « • ) 

- More sophisticated statistical analysis. 

* Identify tlje^user jand thcf software available to him. 

* Good, documented quality checks in software thatvt^ag the 
ability to ofeeck and impute. . 



* A table 



package for generating machine-readable files and dealing 



with the missing data. n * ■ * , 

< It was noted that tabulation definitions vary, and it wars Suggested that 
it woi)ld be more appropriate to consider all funptions in processing, such as 
maintaining the universe, sampling, response control, editing and screening 

19 



* returns, cross- tabu latino-analysis, and presentation. This wou^allow 
dealing with each more efficiently. Perhaps what is needed is that the 
Bureau 1 s tabulation be done 'in such a* way that other things can be done with 

> %he results* ^ • r ; ( ' 

The need for software to handle hierarchical files and the time-series 
■ ; t 

processing and analysis was not$d, and it wis also pointed out Jthat the 

National Bureau of Economic Research, thcC UAMD Corporation and the/Massa- 

f/ chUsetts f institute of Techn^ogy all ij*ve *sdf twa^e for^ <2) • # "v 

Users wye Itemed % t<* define what is "dpceptable" 'date and how they 
shofcld be presented , and do the same thitfg, for software . Should* the Bureau 
work on existing packages available outside and act as a clearing house ftor 
them?- Re sponsej could be that the Bureau simoly^jfhou Id organize its dat* 
in sbch a way that they can be used with existing packages oiwtfcat the^kaofn 
tabulation packages and their respective characteristics* be listed* A 
visiting research fellow might try to identify the conwpttalities or unique- 
nesses of user' needs so they could be linked, with package capabilities; he 
could help simplify the m$tch r and decide wha£ training (if any) would be^ 
needed so that the user .would be best served*. He also could identify what 
the Bureau would need to do in providing the data. This could involve both 
economic and demographic programming. t • \ . 

The discussion included the subject of installation and training needed 
for systems, the costs involved and vftiat hapWens when the user ^omplains about 
a system in place. Questions were raised. Does the' discussion imply that 
the NSF should stimulate the supply or the demand? Are there too many systems 
and too few users for each? Should users be informed about the packages that 
are available and their problems with them be investigated? Discussion then 
turned to the Bureau 1 s generalized tabulation system proposal, in which it was 
cautioned that the Bureau shoifld allow .the system to evolve locally, an<J to 
what a visiting fellow might do at the Census Bureau. 

The question of the Bureau develop!^ a data base dictionary that is 
readable by various systems was raised. Tt^lis dictionary would require " ^ 
continual updating and the problem of how to make this automatic* could be^ # 
addressed In a research projeot. These things are being done, but recommenda- 
tions are needed on exactly how. One suggestion was to put the dictionary in 
codebook. format and make it available for reading through interface packages 

> 20 V . 



that usets might have without having to use*codebooks themselves. Vendors 
will, produce codebooks, but the Census Bureau should be motivated to enhance' 
the utility oi; its data. Should the Bureatf take on the task of making these 
data more usable or let the vendors do that, since the Bureau has its own 
needs as well? .■ * - \ k 

The group then turned to formulation of its' recommendations, focusing * 
on XI) the areas for f^earch and development needed . to better satisfy 



users 1 requirements, and 4 (2) tools or access tor topis for further use of 
machine-readable data, either directly or through a distribution center, ^ 
It also was felt that it should be made possible for a user to .designate a - 
submodel wherf suppression occurs . There was y£ discussion of suppression, 
random rounding and ,l noise M injection, and^there was, sentiment in favor of 
research for alternatives to all of these. It was felt that a system can be 
devised th^t permits greater detail than is presently available and ITtiML 
preserve conf identiafity . *• ^ 

There was general agreement that data should be as portable as' possible, 
and that there should be a machine-readable dictionary in well documented 
format (e.g., compatible with SPSS) arid well tied to the {data elements. A 
subset of the dictionary could be used for translation program^ and a format 
-Statement. There were : differences of opinion as to whether it should be 
possible to run this dictionary *^ all kiflx?s of computers. . * 

> * there also was disagreement as to whether the Census Bureau should 
distribute generalised systems, because this might entail servfcihg them as 
well. It was suggested, however, that the Bureau should create a system, 
implement it and then consider the problem of distribution. If the Bureau 
develops extraction software this should be made as portable as possible, 
being written in ANSI COBOL or COBOL. The gr<jup thought that the Bureau . 
should develop a generalized extract program And a modified data dictionary 
with an eye^to their subsequent portability. It also should be able to °^ 
respond efficiently to demands for extracts. There was some dialogue over " 



the 'cost of a tabulation program equipped to do extract work, with estimates 
running from $300,000 t<t $600,000. While this Jas deemed to be expensive, 
the alternative might be anywhere from 500 to 5^,000 Federal contracts in* t 
various parts of the country that would have to inalude funds for independent 
software for tfiis purpose. 

. 21 , 



Itttas felt elsewhere that a Federal agency such as the Census Bureau 
has an obligation to' make, its software known to the public, buf. that it 
should not be in the software dissemination business. On the other hand, the 
agency uses tax money to build a system for it^ own use<* so the system ought 9 
to be usaMe outs id* the agency for maximum cost benefit. There was .no 
agreement on this tqpic. One possibility is that ,vtfndors should be stitoulated 
to produce their owh interfaces with an agency 0 system* Stntewhere, however , 
there should be a^ effort to bridge th6 ^ap between U Census Bureau system' ^ 
and. local users ^ ^ r - , % - ; 

Tfris discission led to tentative recommendations that therfe.be an 

investigation* of the need for software to transform data for use in a geperali 

tabulation system, and of the need for corresponding dictionaries* It also 

was suggested that the Bureau generate varipus recodes ^^the items in its 

delivered tipes; this* would avo^Tep^titive recodes that might be ^fleeted 

in the dictionary * The recodes and associated headings and stubs could be 

supplied in ,ttie dictionary, ^togtether with a hierarchical key understandable • 

to the system. . This is partially availably in the START system, but not in 

UNIVAC. One mellber^recqmmended that the Bureau proceed to flftake generalize^ 

tabulation 'software available to users, either in , the form of access or- 

pro-ams with support . A visiting fellow might be asked to assess '.the demand 

for such software* or at least evaluate the potential. Several .participants 

* ■ • - , •..•>- 

called for documentation of this software so that users could implement it 

• ' * 

without difficulty. * There was some disagreement' as. to whether the Bureau 
would be obligated to document beyond its own needs for internal use. 

Possibly if four or five heavy, knowledgeable users^ of census data 
Jointly advised the Buw/au on the development of* usable extract and other 
programs,, the NSF might be interested in underwriting some' of the group . 
costs. There were divergent opinions on this, but a consensus that someone 
should make this possible. * • 

It was suggested that "generalized tabulation software in the Bureau 
should be developed' with an eye toward JLt becoming par^of the public domain, 
and the group wag told that this is one of the Bureau^* t? objectives , given 
inpjjt^from users as to the directions such software might takei. There was a 
feeling that, the Bureau "shdu Id make a greater effort toward, this end, and • 



that users should be assured that there are adequate resources for pft/viding 

the ^detail they need once the software is available, e.g., output tables for 

further analysis,, additional computations (medians, drder statistics,' etc.) , 

and the capability to handle as input recprds that require file manipulation* 

there was a discussion of hdw all tbis , could be brdught about, and it 

was suggested that the NSF might take up the issue with an ongoing* organ!- 

zation *such as the Association of Public Data Users (APDU) . This. could be a 

yehicle for intet&ction witt^ users concerning, the tables requi;redy:o meet^ * 

their needs. It Was suggested* on the other hand, that the Bureau already 
«/ * / " . * 

has channels for such dialogue. a .One proposal w$ls that the • NSF* might make it 

.» « * 
possible for users to spend time at the Census Bureau , so th^tt they and\ the 

Bureau staff would have a better grasp of , each^bther f s operations. 

Cooking toward 1980,: the initial investment in generalized systems H 

would be v£ry great? unless the files are made available in more usable forms 

than they were for 1970; vendors would hesitate^J;o fill gaps between the 

Census product and user capabilities* Might the NSF establish and support 

an activity that would ensure adequate planning and appropriate a 1 location 

of fiunds to obviate these gaps? The activity might be lodged i*i the APDU 

to ensure wider involvement. There was a discussion of whether the APDU fs 

capable of such a function.- 

\A possible general recommend At ion that would take into account "exploding 

technology, and the' need for technology lor minicomputers was discussed 

briefly. ' ' * \ , • • 

Presentation of Recommendations 



In presenting liis group's recommendations to the conference, the group 
chairman stated tha£ fehey bed rejected £ comparative evaluation of tabulation, 
systems because the variables-abounds, environment, objectives, equipment, 
etc.— are. too great." It was felt that there is a residual gap between the 
development of needs in the market and of services in the Census Bureau; this 
gap merits further investigation. It would be valuable for users to visit 
the Bureau for short periods,, and vice versa, to go through a T&tiety of work 
using census data; further, thefe should be interchange involving such 
organizations as the APDU tp 3 try to solve data problems. 



Recommendations • , j • 

The group had discussed the Census Bureau's plans 'in the field of 
generalized statistical tabulation. There was a strong feeling among • 
users outside the Bureau that the situation with respect to availability 
of generalized software and data (other than published tables) was likely to 
Be little better than the most unsatisfactory situation which was obtained 
in the past. Special mention was made <j£ the need to improve services and 
products of the 1980 Decennial Census compared to that of 1970. 

In the Short term (for the next 3-4 years), the group appeals to the 
Bureau to maximize its efforts to respond to user needs with respect to 
machine-readable data and appropriate tabulation software* Failure ta do 
this will lead to continued problems such as those that existe^4t£tet4*he 1970 
census-- namely , continued parallel and redundant efforts by many users 
(of^ten supported by Federal funds) to overcome deficiencies, loss of informa- 
tion, failure to use information, etc. * 

Special mention was made of machine -readable data dictionaries, which 
this group felt to be of fundamental importance, especially fqr the 1980 
census* The group requests that the Bureau work with existing groups such 
as the Association of Public Data Users (APDU),*the Federal Statistical 
Users 1 Conference (FSUC) , IASSIST, etc., that have already addressed the 
subject of terminology, conventions and definitions, in order to ensure ^hfi 
data dictionaries are meaningful to users; The Bureau should also provide 
as detailed information as possible oh its own data dictionary plans to the 
'user community as soon as possible. 

for the longer term (1980 and beyond), the users among the group agreed 
->to work through thcfir professional organizations to bring the needs of the ' 
user community, to the highest possible forum* It was felt that the U*S# 
Congress must improve its perception of the value of Census data* 

The group recommends that the Bureau continue its efforts to close the 
gap between supply and demand for Census products (other than published data) 
'in order to a^gd the problems outlined above* 

This sub-group recommends that the NSF support an investigation into 

ensuring the adequacy of planning and allocation of appropriate resources to 

i f « > 

tne*t identified user ne$ds # «. 



24 



28 



0 , * 

ERIC 



Further specific topics and conclusions of the sub-group are as follows: 

Impact on Bureau of the Census Summary Tabulation Plans for 

Proposals to Meet User Needs • " 

* Ensure that general tabulation software provides tabulations 
needed, i.e., no information is lost in the treatment of suppressed 
data (privacy versus maximizing information at detailed geographic 
levels); all information necessary for subsequent analysis, including . 
(k) output tables for further analysis (provided in useful formats), 
(b) capability for additional computations developed while tabulating 
(medians, order statistics, etc.), and (c) capability to handle (as 

* input) records that jrequire manipulation* 

Data Portability f ^ 

* Produce a machine-readable data dictionary that includes 
re^odes, definitions, etc., and provides 'easy mapping to data 
elements. ^ * % 

* Ensure efficient and effective management of updates to the 
dftta dictionary and of its distribution to users. 

* Sup^brt the development, with an eye to subsequent porta- * 
bility, of generalized extraction software that will provide auto- 
matically a Modified data dictionary. 

* Investigate the need for software to transform data and 
create dictionaries to use generalized tabulation systems. 

* The Bureau of the Census should generate various recodes 

of items in delivered' tapes to avoid repetitive retoding {needs 

* *' <\ 

to be reflected in the dictionary). 

* Efficient mechanisms and .procedures should be established 

to extract data for users and to man&ge^the response to such requests. 

** Minicomputer applications should be considered in planning 
for data portability. 

* * 
Modification of generalized Tabulation Software Development 
Toward Eventual Dissemination To and Use In the Public Domain 

* The group applauds cinsus Bureau plans to elicit information 

I • 

on the needs for features anf documentation to facilitate this, but 
strongly confirms Its recommendation that the NSF support an investi- 
gation into ensuring the adequacy of planning and the allocation of 

appropriate resources to meet identified needs. 

25 f 



v 



r 



2<f 



i 



ft 



ERJC 



The Group Requests the NSF to Support Research a nd Development 
Into Efficient and Effective Techniques for the Generation of 
* Statistical Tables \ 

* Such research ought to consider what statistics (e.g., cell 
medians, quartlles, etc.) can be easily commuted along with the 
tables to give a more complete description of the data's patterns. 

Data Pr esentation Group 
— r 

.There are two distinct areas of data presentation, the first dealing 
with machine-readable forms such as tapes and the second, the noncomputer- 
readable final product ^such as microfiche ,\ film and paper-copy graphic '« 
displays. There Is a need tttjtfetus on users' requirements for census data 
as well as on software th£*-T$hould be developed. It was determined, by the 
group that software for data presentation falls' Into three categories: 
routines that produce graphics, those which organize the data, and routines 
that prepare data for graphics. .Sophisticated software already exists to^ 
produce graphics but is needed In the remaining categories. 

Education and Communication in the Area of Data Presentation 

A lack of education and /or communication with respect^lo the area of 
data presentation Is a major problem. In the discussion it was noted that 
data presentation *is not a visual process alone, but that an understanding 
of the data needs to be Included. Footnotes and explanations that accompany 
visual material tend to be shortcut. One hazard noted was that the printed 
report is an excellent means of promoting an understanding of data, but 
that it is ignored when it -accompanies gr-aphic material. Computerized 
documentation is a partial solution to the problem, but often users will 
ignore a more detailed printed report in favor of condensed, automate^ 
documentation. In the absence, of documentation,/ users interpret graphic 
output as they see it. *A well organized, readable book might be sponsored, 
showing a broad spectrum of Census 'data u*es; perhaps a comic book and /or 
film approach would be appropriate. Interaction and Involvement were cited 
^s good .vehicles for education^ and perhaps the cbncept^of the tensys Bureau's 
DIME workshops could apply to the area of the use of census data\ 

Various methods o^ computer-assisted education and communication wer \ 
discussed. Microfiche could be produced at a central facility and dlstribH^Jd^ 

*26 • ' ' 



30 ; 



among users* Thrf benefits of microfiche include low cost and eapy accessi- 
bility. The usef of data machines wiA a CRT (cathode ray tube) and cassette 
capability as aj means of disseminating information was suggested. These ^ 
units are inexpensive and have the benefit of analysis as well as display 
functions. An interactive system that could lead the user through requests* 
for Census d«la as well as provide e ducat ionaVfacil it ies was suggested. 
Problems witH the interactive System approach include greater expense, limited 
accessibility, and a reluctance on the / part of State and local users to\fund 
timesharing/ rather than a capital investment. ' 

The Census Bureau might well take advantage of the motivation that 
exists at /local levels to aid in the implementation of an educational process. 
The Bur^a|ti could supply educational support to a State that commits itself 
to the program and the State then would be responsible for the distribution of 
information to local users. 

Data Sj/lection and Requests for Dajfci 

The' areas of data selecttop^presentation and education are inseparable, 
as sMown by two different directions that the dat* presentation process takes 
as d result of a lack of knowledge, it was noted that the uneducated user 
often requests a "dump 11 of all available data in a rough form in qrder to 
determine which subset of the data upon which to focus. Once the Subset bks 
been determined, the user then requests more sophisticated displays. The 
other extreme, is the user that initially requests a small subset of data to 
be presented, only to learn thjkt more .is available, resulting in further 
requests. Education as to the! availability *of data and the paeans of presenta- 
tion would offer a partial solution to the problem. 

Several methods were suggested, to' aid in the selection process of the 

* 

subset of data to be presented; one was that software should be developed to 
select subsets of da"ta. Problems with this* include hardware limitations of 
some users and the expen^fe involved in developing and implementing a software 
solution. Microfiche was suggested by another participant as a possible 
alternative in Tight of Ae expansion of microfiche capabilities. Data from 
summary tapes qould be stored on microfiche, enabling a user to select from 
the available data. A participant s^gested that a Regional processing 
center could exist with the hardware and software necessary to provide data 
to the community. ) 

♦ 27 



It was observed that the selection process controls the lev«tl of 
presentation and also the analysis that can be performed on the data. Data 
pbssibly should be presented without analysis, leaving that for the user to 

i 9 

do. 

Another problem seen is in the timing of requests for data. Following 
some Federal announcements, many requests were received. Software could be 
developed to facilitate the handling of data requests, which would also 
avoid duplication of effort in the case*oJ commonly used reports. An inter- 
active system could supply the requested data. It was suggested than an 
area of research might include defining the classes of commonly used data 
and also the means of their presentation. ■„ 

Another means of facilitating the processing of data requests might be 
jjo sponsor a legislative analyst at the Census Bureau who would be responsible 
for surveying all legislation and guidelines pertaining to data requests by 
users. He could also determine the Federal programs that the user might 
"qualify for. . / 

Data Editing 

There were several complaints about the lack of software in the area ot~\ 
data editing, i..e., getting the data in a format that is useful for their 
purposes. A relatl^Mp needs to exist between graphic packages aid a data 
base management systieia, which would facilitate ; the use of the existing graphic 
software. One areaf-jf research could be the problem of organizing large f 
ajnounts of data for graphic presentations. 

Different dat^ areas by Census and the user are a major problem. One 
lication, tha^t* Was mentioned was that of - forecasting future equipment and • 
'npower neeW, °j demand fpr a product. Th^s requires the ability to overlay 
nsus and usVr data^/ahd the procedure is very difficult when the two data 
areas overlap. - 'Perhaps a smaller census data tabulation unit could be 
detAwttiae-d which would allow users to aggregate Census data up to their 
particular data^trea. It was noted, however, that a trade-off must be made 
between mor^ata for J.arge areaslind less da*ta for smaller areas. The 

smaller .the area, the greater' the occurrence of suppressions to avoid 

* % • / 

disclosure. « ' 

The pro£iem of differing data, areas is further compounded by the poor 

■ . ^ S -» • 

. - ■ ') • • v . , 

( • •' 32 



coordinate quality found in Census files. The user typically must convert 
Census DIME files to user polygonal-area files. Several participants com- 
plained that the coordinates found in the Census GBF/DIME files are very 
inconsistent and that a good coordinate system is one of their functional 
.requirements. It was stated that the process with which ^ordinates are 
, edited at Census is too cumbersome to be practical, and that the Bureau 
lacks incentive in this area because coordinates are nor used in its own 
applications of DIME files. Research exists in this area; the Arithmicon 
system, presently in the research Stage at the Census Bureau, provides an 
interactive capability for editing and maintaining DIME files. 

It was suggested that- research should b^ conducted in the area of 
Census data presentation form. A different form might result in easier 
. conversion to user data ar^as. Raster form was discussed as a possible 4 
alternative, as that field is rapidly expanding*. Valid areas of research 

0 

would be to investigate the level at which Census should distribute data in 
raster form, as well as raster vs. polygon vs. DIME forms for distributing 
data. Data files could exist at different levels, perhaps at as many as 
five. It was mentioned that perhaps the Bureau should not get involved .in 
the area of providing data for areas other than an agreed-upon unit of issue 

Color and Graphics • 

Graphics are the final end product for many data requests and are a ) 
very popular means of presenting data. Although the group agreed that 4 
sophisticated software already exists to produce graphics, it was suggested 
that research needs to be conducted in this area. One participant suggested 
research into tfce most frequently requested types of graphs and visual ^ 
presentations. Another suggested research to determine which subsets of* 
data should be graphically presented. ^ - 

The concept of color with respect to data presentation was discussed 
and research was suggested in this area as well. Research might include 
experimenting with color and making comparisons to determine what is most 
effective. Everyone has £ different concept of colo»; the same color can 
imply different meaning to different people. Another noted that users often 
state exactly which colors they want in their presentations. It was pointed 

1 

out that quantitative scale mapping is not adapted to eolor. A participant 

, * 29 



feit that" the advertising field has already performed* much research in the 
area of color, and perhaps what is needed is research of research. 

User Interface and Service Organization 

Many participants expressed desires for an automated user .interface to 
ease the process of presenting Census data. An interface is needed between 
local, Stiate and ^Census data. The need for development and marketing re- 
search in the area of a common user interface was discussed. Such an 
interface would ease the problem of. using Censtis data for the nonsophisticated 
user. It .was suggested that the existence of "an appropiAate level of standard 
ization vs. a limit in flexibility should be, investigated. A user interface 
with a query capability would provide facility between Census data in a raw < 
form, subsetting and aggregation routines,' and graphic-analysis routines. 
It was recommended that the development of user software should be keyed to 
a data dictionary, which would enable it to be flexible in case of format 
changes. User software should be machine- independent. . 

The idea of a service organization to provide software services and a 
user interface was discussed. The service organization would be responsible 
for distributing data in various forms that wduld facilitate matters for 
users of Census data. Listings of software applicable to the use of Census 
data could be maintained by the organization, in order to refer users to 
appropriate consultations. It was questioned as to whose reponsibility such 
an organization would be— .government or industry. Concern was expressed 
that perhaps government might be interfering with private industry in this 
area. There was some feeling that the Census Bureau's first obligation is 
to provide data and that software development must be at least secondary. 

Presentation of Recommendations 

• The group asked that consideration be given to instructing local -users 
how to cope with Federal program applications that require census data for 
small areas. If all software were to access data via simple dictionaries or 
more complex data base management sys terns , \here would be far-reaching 
effects on software development. Also stressed was the fact that training 
and education are major requirements' for effective use of census data and 
for the development of useful, user-oriented software. 



J ■ 

■ " • ». 



V. . . . /- - t 

Recommendations ♦* f 7 

Needs of users and 4 alternative modes of presentation are both extremely 

j v - f. 

diverse. Some can be directly addressed by short-term recommendations for 

user-oriented software, while others require longer-term efforts in which 
^formation must be gathered before software recommendations can be formulated. 
The Data .Presentation Group considered a broad range of possibilities, and 
J its recommendations reflect concerns shared by the other two groups. The 

overall theme is flexible and effective public access to census data. We 
have identified two major areas in which public access can be facilitated*— p. 
user education and technological improvements. Under these major topics we 
have listed a number of specific gap^s or omissions to be dealt with. We also 
feel strongly that the technical program should be integrated with the 
communication program, and that the integration of specific technical acti- 
vities is essential to the objective" of facilitating public access . 

V • ' • 

User Education 

1. * Materials (various multi-media forms) .should be developed for the 
purpose of educating/commuriicating the use of Census data. Training courses 
should be developed involving computer-assisted instruction, movies, video- 
tape, programmed learning* texts and case studies* ^ 

2. We recommend that the research fellow be a trainer to develop a 
specific training program for census data use (technical and professional). 
See recommendation 3. 

* 3. Investigation should be done on users 1 needs and desires for output 
) media, , in ord<?r to determine products (e„g # , slides, pap^r maps) to be 

produced. ^ - 

* - 

4; Research should be encouraged in display techniques (e.g., color) 
for qi^ntitative information. ^ 

Hardware 

1. Research should be conducted on the potential of new processing 
technology (e.g., terminal access and mini- and micro-computers) in the 
analysis of census data by users with limited resources, and the implica- 
tlons of that potential' on prospective Census data-documentation techniques. 



31 



4?' 



ERIC 



«t Software 

1. All software developed for usepfl should access data via a data 
dictionary to remove format dependencies from programs associated fyith 
reading census files. t . 

2. Software should be developed and made available by the Census 
Bureau for handling the most basic and simple? types of data retrieval and 

presentation. . 4 

3. Software should be developed tio prejent data about change through 
time. (A data base should be developed which defines changes and equiva- 
lencies in statistical areas) . 

fe. The software developed -by the Census Bureau for its processing 
shoulB be documented, and also .made portable and available where feasible. , , 

5. Geographic base files should be developed to facilitate time-series- 
analysis of small-area data and to iiM^direct access to census data via 
independent geographic "Coordinates. 

6. Research should be conducted to deteWine the special machine- 
readable files (extract files) and extraction programs 4 that should be 
produced for. special program compliance. 

Data Requirements* (Geographic) 

1. Higher standards are required for coprdinates in geographic base' 
files (GBF's) in order to allow user specification of tabulation areas in 
terms of coordinates. Specifically, GBF coordinates should be corrected \ 
topo logically and cartographical ly. * * 

2. A machine-readable. data base should be developed which defines 
changes and equivalencies in statistical areafe. > * 

3. The Census Bureau should provide separate machine-readable files 
of spatial definitions (e.g., polygonal coordinates or raster) for all 
statistical areas. 

.» 

Organization 

l. # Investigate the possibility of a user clearf^iouse(s) for the 
availability and development of user software. 'Set up a clearinghouse 
for user software and .investigate the possibility of developing and supporting 

user software. , _» — „ . / 

An ongoing assessment of user needs for software should be conducted. 

Compile user comments and evaluations of software, and form abusers 1 group on 

.. 32 V ■ , * 

36 



user software # • • I 

2. We« support £he concept of summary "tape" data processing centers. 

■ ' V ■ v . 

DISCUSSION AMD ACCEPTANCE OF RECOMMENDATIONS BY THE CONFERENCE 

Submissio n of Preliminary Group Recommendations to the Conference 
The, third day of the conference began- with the submission of the pre 11- 
minary^roup recommendations to the plenary sessioVA During the opening , »' 
discussion concern waVexpressed that the Census Mireau still is using 1950 , s 
techniques and needs modernization. Some portion of the members wanted to say 
that the Census Bureau is "In trouble," and that the cost to catch up,, in* the 
face of political and s6qial 'needs, is Increasing -rapidly. These need* cannot 
be met with current technology. * 

> There was a discussion of the respective responsibilities of users and 
the Bureau with/re spec t to filling the technological gaps foreseen. The con- 
sensus appeared to be that the average user needs tcTbe trained to use the 
tools at hand, and that the. Bureau, as it develops techniques and software, 
should constantly recognize -users 1 needs and abilities to keep pace. 

It was observed that the Bureau plans to /replace its hardware completely 
by 1#82, and this hardware will be geared to/data base management systems. 
The Bureau would like users to'spell out in detail what their data needs are 
so that the Bureau's specifications can match them. 

- It was agreed that it would be helpful to recommend the first explicit 
step(s) to the NSF, and the groups retilrned to their individual sessions 
for further considerations # ° . ^ ^ 1 

Acceptance of Final Grout) Recommendations by the Conference 
Upon completing the additional deliberations by individual groups, each 
group's final recommendations were read and discussed' by the conference as a 
whole. Some language was modified to reflect consensus positions, and the 
approved texts appear above in Section IV. The deliberations in the final 
individual group sessions were not reported; only the plenary discussion whfch 
follows* below* 

Comment was made that what users tend to do is limited by the technology 
available. The history of extensive analysis that led. to research and 

33 



development in the Cetfeus Bureau during the 1960 f s and still conducted by* 
.its Center for Census Use Studies was cited, but note was taken that some 
projects that should have been carried forward were not. 

There was a discussion as to whether the. recommendations should be 
time-oriented. It was felt that the conference may have the 1980 Decennial 
Census in mind, whereas there are economic censuses, surveys and other 
statistical pWrams being carried out in other years. It was agreed that 
"shortH^rn" might be interpreted as 3 to 4 years, but' with emphasis on 1980. " 

Question las raised whether this conference or the presentation group 
might be the beginning of a user group to address in more detail the Various - 
items suggested. AnotherXsuggestion was tie establishment of a clearing- • 
house to foflow up on theJconference agenda items, noting that tje Census ; \ 
... Bureau, its oversight coiUittee in Congress, .and the Office of Management 
J and Budget are onl^jJoVof the "actors" involved. , Perhaps there might be 
' a follow-up conference in a year or two.- . The review process that has been 
set up for the* ASA, the Census Bureau, and the NSF's four joint projects ' 
results was mentioned and also that there will be general meetings with the 
census Advisory Committee of the ASA. It was noted that there will be efforts, 
to formalize user support as ranch as possible and the: report of *thi«- confer- \ 
ence will be given wide circulation. An offer was made to .monitor progress a 
N year from now and report through a uster journal. ' 

' It was suggested that a good ,use\of the conference, resources would be \ 
to look at the purpose, process and impact of the 1980 census data^products 
and software on data processors. Training modules may be needed, for various 
user groups, together with data and use guides. 

A question was raised as jto whether theWeau would feel the conference's 
attitudes are unjustified or *d is tor" ted, and whether the Bureau is worried about 
its software products and their distribution. reply it was stated that 
discussion from all standpoints is being encouraged^ The Bureau will receive 
the recommendations and be glad to state what is being-kr can be done to carry 
them out. Another participant felt that the conference^ supportive of 
improvements. ' It would be helpful, however, for the Bureauto tell how it 
will use the conference information and what it is doing. 



The following resolution was then passed: 

, 4 

'The conference expressed its desire that the Bureau of 
the Census be asked to advise participants through the 
American Statistical Association yt its! plans £> respond 
,to the various recommendations contained in the report 
of the proceedings of the conference." 



< 



^ 



35 



> 39 



APPENDIX A 

'.»-•■•• » • •. ■ • 

NAMES. AFFILIATIONS. ADDRESSES! AND BACKGROUND' OF CONFERENCE PARTICIPANTS ". 

WILLIAM T. ALSBROOKS, Assistant Division Ch'lef, Systems Software D-ivliion, U.S. 
Bureau of the Census, Washington, D.C. 20233. M.S. (Computer Science) , 
Purdue University, 1970* Formerly Programming Branch Chief pf Statistical v 
Methods Division of the Census' Bureau. . ' > 

•*' • ■'.» ■ . . » • 

it . 

MICHAEL J. 4 BATUTIS, JR., Principal Demographer, New York State Economic Develop- 
ment Board, P.O. Box 7027 - AESOB, Albany, New,. York 12225.' M,A.>, Duke 
University, 1972. Has served as demographer with New York State jsi nee Duke. 

PATRICIA C. BECKER, Head of Data Coordination Division, Planning Department, City \ 
..of Detroit, 801 City-County Building, Detroit, Michigan- 48226. M.S. 

{Sociology), University of Wisconsin, 1964. Before going £o\Detroit in 1968.' 
did academic survey research at the universities of Michigan, Wisconsin and * 
Californ ia (Ber keley). 

JOHN BERESFORD., President, DUALabs, 1601 N. Kent Street, Arlington, Virginia 22209^ 
' M.A., University of Michigan 1952. After military service he was with the 
.Bureau of the Census until founding DUALabs in 1969- He is presently ' / . 
Chairman of the Association of Public Data Users Census CQmmittee! 

WILLIAM M. BRELSFORD, Supervisor, Statistical Computing and Methodology Group*, 

Bell Laboratories, Holmdel, New Jersey 07733. PhD (Statistics), Johns ' n, 
Hopkins University, 1967. '..''*/" . • 

HUGH FRANCIS BR0 PHY, Chief, Systems Development and Programming Unit, United 

Nations Statistical Office, Room 3114 United Nations Plaza, New York, Ne> . r v 
York 10017. B.Ec. (Hons), Australia National University, 1^65- Held " 
Deputy Director of Computer Services and other posts* with Bureau, of 
Statistics, Aw^ralJ^ipid; was Project Manager of a computing research centre 
in Czechoslovakia. "* v! 1 

LARRY CARBAUGH, Data Users Service Division, Roonf 3624 - FB#3, U.S. Bureau of 
the Census, Washington, D.C. 20233. B.S. Duke University, 1964. 

BRUCE CARMICHAEL, Group Leader, Central Data Base Group, U.S. Bureau pf the 

Census, Room 1373 - FB #3, Washington, D.C. 20233. -PhD (Computer Science) \ 
University of Maryland, 1976. Consultant to General Electric Space Flight 
Division, systems analyst at NIMH and technica 1 stefff member at Bell * 
Telephone Laboratories. 

f 1 

WILLIAM' S. CLEVELAND, Member Technical Staff, Bell Telephone Laboratories, 600 
Mountain Avenue, Murray Hill, New. Jersey 07974: *. PhD (Statistics), Yale 
University, 1969- Assistant Professor; University of North Carolina - '» 
(Chapel Hill) before joi nrng Bel 1 Laboratories. - * 

LAWRENCE E. CORNISH, Chief, Graphics Software* Branch, U.S., Bureau of the Census, 
Room 1529 - FB #3, Washington, D.C. < 20233. Michigan and Michigan SJate 
Universities., 

♦ * 

JACK DANGERM0ND, Director, Environmental Systems Research Institate, 3$0 New York 
St., Redlands, California 92373. MLA, Harvard University, 1969. MA (Urban 
Design) , University of Minnesota. Was a teaching research associate at 
Univarsltles of Minnesota and Harvard and served, as project manager with % 
Scientific Systems, Inc. and as director of the Environmental Systems 
Research Institute. , '. * r , ' / 

: 37 'i () / 

« . . . - 



PETER DHCKINSOlf, Director, Data Processing', Center for Demography, University of 
Wiscop*#ft, 1180 'observatory Drive, Madispn, Wisconsin 53706, MA (Sociology) 
Unlversjty, (jf'wisconsln 1975. Was programmer analyst with the Center for <, 
/Demography and pholJbgrammetric surveyor with the- U.S. Forest. Service. 

RICHARD^.. ELLIS, Marketing Manager ,\ I nformat ion, American Telephone & Telegraph! 
Co, ,.295 N. Maple Avenue, Basking Ridge, New JeYfiey 07920. B.A., Hamilton 
College 1950. He fd other market i ng position? with v AT&T and was supervisor*, . 
Corporate Staff, with the New' Yor^ Tel efihone Company. 

CARL *E. FERGUSON, ' JR*.^ Director, Center* for Busine'ss and Economic Research, 

*- Box University of Alabama j University, Alabama 35^86. PW), University 

of Missouri 1975.: Before coming to/^Mbama as Assistant Director' of the 
... ' Center .was Assistant Director of tne Public Affairs Information. Service, 
' University of Missouri . ' . * 

LAWRENCE FltityEGAN , Data Users Service Division, Room 306.9.- FB #3, U.S. Bureau 
of the Cepsus', Washington, D.C. 20233. ** '« . / 

/ JAMES FOLEY, Associate Professor of Electci-cal Engineer inland Computer Science, 
^George Washington «Uni.versjty, Washington, D.C. 20052. PhD, University of 
.-Michi^ah^l969. Was assistant -professor at University of North CaroTl-na, 

'/ "and with. th« Graphics- Software Branch of the Census Bureau. 

WILLIAM^. VrEUND, Leader, Systems and Programming Groups, Dat£ Services Center, 
"'• "ERS^, [).»• •p*ep^ptment!^)f Agriculture 4 , Room W QHI Buildings, Washington, D.C. 
20250r *4^jtojversity of North CaroJ i na , 19&3- <Has held a variety of 
' positions* in Economic analysis and systems design wflfh the Department of 
Agriculture after" graduation from North Carolina. 

- SHIBLEY G I LBERT, 7 Consultant 3nd c^ata analyst, Pr i nceton-Rutgers Census Dafra 

' ' Project, Princeton-University, 87 Prospect Avenue/, Princeton, New Jersey • 
08540.~ M*A. , University of Oregon, * 19^6., Was an instructor* in mathematics 
at New JerseV-eol'tege for Women (Rutgers) and University of Oregon/ 

- WARREfo GLIMPSE, Data Users Service Division, Room 3Q§9 - FB #3 ,' U.S. Bureau of^ 

the Census, Washington, D.C. 20233. B.S.., University of Missouri, I969. V - 
Was Director, of ftbUc Affalfs *nd taught at Missouri. Consultant to Tn- 
/* . * dustry a^ government oh Software design and evaluation. 

SCOTT B. GUTHERY, Principal -Software Engineer, Mathemattca, P.O. Box 2392, 

Princeton, New Jersey -085*f0. PhD, Michigan State University 1969. Worked 
" previously In appl ied^stati sties and data base management system research 
'"with Bell Laboratories.,/ , ' > . # 

• , ... . • ' * 

ROBERT D'^NdARRIS, Deputy Ass i stant- Director, "Congressional Budget Office, 2nd and 
- D-Stredt, S.W., Washington, DyC. 20515. 'B.S., Ohio State University i960. 
Prior to loiKfng the Congressional Budget Office was Chief of Information 

• •/ Services with-the Office of Management and Budget and held a number of posts- 
In 'the Department of" Agriculture . ^ 

<* ■ • *- * * 

GEORGE M: HfUER (Conference Co-Chai rman) , Principal Researcher, Statistical > 
"Research Division, U:S. Bureau of the Census, Washington, D.C. 20233. ( 
V. M:A. , 'Columbia University 19^9. Has held, a variety of positions with the 

Bureau of the Census since coming there from Columbia. 

' GARY'L.' HI LL, Director, J nformat ion Systems Department , CAC I , Inc., 1815 N. Ft. 
1 4 Myer pnive, Arlington, VI rginla 22209. MBA, Indiana University 1 96 1 . Has 
, been an off leer of Data Use ^Access Laboratories, ' Computer Resources 

* ^Corporation and project manager at ^BM. 



and 



Research Associate 




DAVID cT^TiOAGLIN, Senior Ana lyst , Abt Associates , I nc 

in Statistics, Harvard University, 55 Wheels^r Street, Cambridge, Massachusdtt 
# ; 02138. PhD, Princeton University 1971. H^s^been on the facu\£y at Harvard 
since 1971 a nd also served as senior research associate at NBER Computer, 
Center for Economics and Management Service. *\ * v 

* 

HAROLD B. KING, Director, Computing Services, The Urban I nstitute,«2100 M Street,' 
N.W., Washington, D.C. 20037- B. A. (Mathematics), San Jose State, California 
195,9. Helped to establish the Association of Public Data Users and was 
with the interuni versity Communications Council. 



FRED C. LEONE, Executive Director, American Statistical Association, 806 15th 
Street, N.W. , Washingtqn, D.C. 20005. PhD (mathematics and statistics), 
Purdue University 19^9- Taught at Iowa, University of California (Berkeley) 
and Case Institute of Technology. Visiting professor at University of 
Sao Paulo, Brazil and was on Ford Foundation Education Team in Mexjco. 

RICHARD G. MAYNARD, Acting Manager, Policy Support and Special Studies Divi 
House Information Systems', 36**1 H0BA #2, Washi ngton , D. C . 20512. M.A. 
(Economics), University of Pennsylvania 1969. Was with EDP Technology, 
Inc. and the Department of Defense. 



MARK D. MENCHIK, The Rand Corporation, Santa Monica, California 90406. PhD 
(Regional Science) Un-iversity of Pennsylvania 1970. Was with New York 
City-Rand Ipsiitute and taught in th'e geography department at the Univers 
of Wi scons i.n. f 




9 

ERIC 



RUDOLPH C. MENDELSSOHN, Assistant Commissioner, Bureau of Labor Statistics, 
Room 20**7, lik] G Street, N.W. , Washington, D.C. 20212. A. B. ,, University 
of Chicago 1938. Prior tp becoming Assistant Commissioner, i n 1967 was in 
( charge of va rious* Bureau employment, hours and earnings statistics. Edited 
the Bureau's journal in that f*ield. 

JULES MERSEL, Senior 0p6rat?ons Research Analyst, Community Development ""Depart- 
ment, Ci ty of Los Angeles, 200 NT Main Street, . Room l4o*f, L05 Angeles, 
California 90012. M.S. (Physics) , University of California (Berkeley) 
* "195 1. Was With the National Bureau of Standards and has had a broad range 
' * of computer consulting positions in private industry. r 

* - 
PtTER A. MORRISON, Member, Senior Research Staff^The Rand Corporation, 1/00 
Main Street, Santa Monica, California 90W\ PhD, Brown University 1967. 
* Formerly assistant professor at the University of Pennsylvania and a special 
consultant to the National Commission on Population Growth and the American 
Future. . ^ ' 

MERVIN E, MULLER, Director, Conputlng Activities Department, The World Bank, 

1818 H Street, N.W., Washington, D.C. 20^33. PhD (Mathematics), University 
of California, Los Angeles 195^. Taught and was pi rector of the Computing 
G^hter at the University of Wisconsin, Managed -Project WELD at IBM and has* 
' . > been on the faculty at Princeton, Cornell and the University of California. 

DAVID M. NELSON, Acting Program Direotor, Computer Information Systems, *H5 
Coffey Hall, University of Minnesota, St. Paul, Minnesota 55^1^. PhD 
(Economics and Stat i st (cs) , Kansas State University, 1968. Has been a 
visiting professor at .Bol se State Uni vers i ty {yd Hamline University. 



39 



\ 



4f 



NORMAN H. NIE, President, SPSS, Inc., Suite 1236,. 111 East packer Drive, Chicago, 
Illinois 60601. Currently Senior Study Director, National Opinion Research 
• Center and Professor^at University of Chicago. Was Senior Fulbrlght Fellow, 
University of Leiden, The Netherlands and Woodrow Wi Ison Fellow, Stanford 
University. Principal invest fgatQr for a number of political science project 

* ' / 

MANUEL D. PLOTKIN, Director, U.S. Bureau of the Census, Washington, D.C. 20233. 
M.B.A. (Statistics), Unlversffev of Chicago 19^9*. Came to his present 
position from the corporate headquarters-'of Sears, Roebuck and Company 
where he was Associate Director, Corporate Planning and Research. Managed 
^the Economic and Marked Research Department of Sears and also served as 
Chief Economist. .Was/earlier with the U.S.. Bureau of Labor Statistics in 
the Chicago and Washington offices and taught in the evening division of 
several Chicago colleges. , 

J0E~W. PYLE, Director of Physical Manning and Development, Houston-Galveston 
Area Council, 3701 W. Alabama^ Suite 200, Houston, Tey<as 77027. PhD, 
-University of Houston 1973. Previously held positions with Boeing Company t 
Philco-Ford Corporation and the University of Houston. 

MELROY QUASNEY, Systems Software I i vis ion, Room 1061 - FB #3, U.S. Bureau of 
the Census, Washington, D.C. 20233. 

LAWRENCE C. RAFSKY, Statistician, Chase Manhattan Bank, 18th FJoor, 1 Chase 

Manhattan Plaza, New York, N.Y. 10015. PhD (Statistics), Yale Universjty 
1974. Formerly at Bell Telephone, Laboratories. 

DANIEL A RELLES (Conference Co-Chai rman) , Statistician, Rand Corporation, 1700 
" Main Street, Santa Monica, California 90406. PhD (Statistics), Yale 
University 1968. Was a member of the technical staff of Bell Telephone 
Laboratories. , ^ 

t r * 

ALBERT H. ROSENTHAL, Rand< Corporation, 1700 Main Street, Santa Monica, 
California 90^06. With Rand since 1953. Currently Senior Analyst. 

ALFRED J. TELLA, Special Adviser, Office of the^Di rector, U.S. Bureau of the 
Census, Washington, D.C. 20233. M.B.A. , New York University 1959- Has 
been Research Professor of Economics, Georgetown University and Director., 
Office of Labo*, Force Studies, The President's Commission on Income 
Maintenance Programs. 

ANTHONY G. TURNER, Mathematical Statistician and Census Coordinator for ASA/ 

Census Research Program, U.S, Bureau of the Census^ Washington, D.C. 20233. 
B.S. and graduate work, University of North Carolina. Has been sampling 
consultant to FDA and Population Research Council and was with the „. 
Statistics Division of LEA. Served in Census previously as Chief of the 
Special Surveys Branch. 

MEL TURNER, Assistant Director, DBMS, Systems Development Division*, Statistics 
Canada, 12-P, R.H. Coats Building, Ottawa, Canada K1A 0T6. B.Sc.(Hoffs) 
(Physics), Queen Mary College, University of London 1966. Has been in 
several programming posts with both Statistics Canada and IBM (UK), Ltd* 



HARVEY WEI NSTE IN, Sp)s 'inc., Suite 1236, 111 East Wacker Qrive, Chicago, 

Illinois 60601 ' , ■ < . 

FORREST B. WILLIAMS, Manager, Marketing and Information Systems Group, CACI, 

Inc., 181 5 N.' Fort Myer Drive, Arlington, Virginia 22209? PhD (Geography), 
Ohio State University 1975. Has been a research analyst with the Census 
Processing Center, Battel le -Columbus Laboratories and Special Projects ' 
Manager for the Behavioral Sciences Laboratory at Ohio State. 

ROBIN WILLIAMS, Manager/ Di splay Systems Architecture* IBM, K 5/f - 282, 5600 
Cottle Road, San Jose, California 95 1 93 ; PhD, New York University 1971. 
Worked in optical character and memory systems with Philips research 
laboratories In England and Briarcl iff .Manor, *New""York. Taught at New 
York University. ■„ ' t * 

PAUL f. ZEISSET, Chief, Data Access* and Use Staff, Data User Services Division, 
Room 35^0 - FB #3, U.S. Bureau of 'the Census,, Wash ington,~M).C. 20233. 
M.A. , Uni.versity of Texas 1969. Has. been with the Data Access and Use 
'staff since college. \ ' 



Management of the conference has been under the direction of John W. 
Lehman, £SA Conferences Director, with the assistance of Barbara ^indell., 
Additional services have been provided by the ASA office. The conference 
was reported by Fred Bohme of the History Istaff of the U.S. Bureau of the 
Census, assisted by Cynthia Agard and Patricia Griffin. Anthony Turner 
served as Census coordinator for the program. \ 



L 



) 



ERIC 



X 



4 1 

4 



9. 



APPENDIX B 



FINAL PROGRAM FOR CONFERENCE ON 
4 DEVELOPMENT OF USER ORIENTED SOFTWARE , « 

Stouffer^ National Center Hotel 
Arlington, Virginia 

November 8, 9, 10, 1977 

TJJESDAyC NOVEMBER 8', 1977 
8: Op - 9:00 Registration 

9:00 - 9:30 Welcome and Introduction . (Potomac Room) 

FRED C. liBONE, 'Executive Director, 
American Statistical Association 
MANUEL D. PLOTKIN, Director, 
U.S. Bureau of the Census 

9:30 - 10:15 Overview of software state-of-the-art in information" 

delivery 

WELLIAM ALSBR00KS, Systems Software Division, 
• A U.S. Bureau of the Census 

• * * 

10:15 - 10:30* Break ' . 

10:30 - 11:15 Current plans and activities of Census Data Users Division 

WARREN GLIMPSE, Data Users Services Division % 
U.S. Bureau of the Census 

11:15 - 12:00 Needs of users from the viewpoint of lo&al governments 
~ x and other public agencies 

HAROLD KING, Urban Institute 

12:00 - 1:15 Lunch 4 (Charleston Port Room) 

■ 

1:15 - 2:00 . ' Nepds for users from the viewpoint of t m (Potomac Room) 

I economists, market researchers and 
I others in the private sector 

RICHARD ELLIS, Market Research, American Telephone & 
, Telegraph Co. " 

Organisation of Data % 

2:00-2:30 , Summary of user paper and questions , ' 

MERVIN MULLER, World Bank . / 

• - ) 
2:30 - 3:00 Summary of Census Bureau paper and questions 

BRUCE CARMICHAEL, Systems Software Division, 

U.S. Bureau of the Census • • 

1 ■ * 

'3:00 - 3:15 Break 

% 1 -43 .'So 



ERIC 



Tuesday, November ^ 1977 - Continued 

Tabulation of Data » 

3:15 - 3:^5 Summary of user paper and questions 

HUGH BR0PHY,°U.N. Secretariat 

3:1*5 - U:15 Summary of Census Bureau paper and questions 

MELROY QUASNEY, Systems Software Division, . 
U.S. Bureau of the Censi 

Presentation of Da*ta 

— ' T 

U:15 - k:k5 Summary of user paper and questions 

' ROBIN WILLIAMS, IBM Corporation 

k:k5 - 5:15 Summary of Census Bureau, paper^and questions 

LAWRENCE CORNISH, Systems SofWare Division, 
U.S. Bureau of the Census 

6:00-7:00 Reception *« (Charleston Port Room) 

8:30 Dinner . '• M ' (Charleston Port Room) 

• ♦ WEDNESDAY, NOVEMBER 9. 1977 . 



7:00 



Simultaneous sessions by the Organization (Room 20k), Tabulation (Room 110), 
and presentation (Room 10U), sub-groups according to the following schedule: 

9:00 - 10:15 Opening statements without interruption 

10:15 - 10:30. Break 

10:30 - 12:00 Discussion" of invited papers and opening statements 

12:00 - 1:30 Lunch (Dewey I Room) . 

1:30 - 3:00 .Proposing and discussion of recommendations 

3:00 - 3:15 Break- 

3:15 - 5:00 Completing Recommendations for submission to the 

full Conference \ 

THRUSDAY. NOVEMBER 10. 1977 

• (Resume full Conference) ' 



9:00 - 9:30 Submission of Organization sub-group (Potomac Room) 

recommendatidns to ful}. Conference 
< Discussion 

♦ « 

' 9:30 r 10^00 Submission of Tabulation sub-group recommendations 

\to full Conference. Discussion 

A 10:00 - 10:30 Submission of Presentation sub-group recommendations 

* to full Conference. Discussion 

ERIC ' ,, , \44 46 



Thrusday// November ^10, 1977 - Continued 



/ 

j ■ 

10: 30/- 10:^5 

10:4,5 —12:00 

/ • 

•/ 

12'J00 - 1:30 
1:30 - 4:00 



\ • 



r 



Break 



Individual sub-group meetings to review 
any proposed changes and prepare final 
recommendations * 

Lunch 

Acceptance of final recommendations 
from sub-groups by full Conference 

\ 



(James Room) 
(Potomac Room) 



* For this period ,the Tabulation (Room 110) and Presentation (Room 10U) 
sub-groups will meet in the same rooms they used on Wednesday. The A 
Organization sub-group will stay at the front of the Potomac Room. . 



45 



47 



% 



APPENDIX C 

The Organization, Tabulation and Presentation of Data 
State of the' Art: An Overview. • 4 

William T. Alsbrooks James D. Foley \ 

Bureau of the Census * George Washing ton^iniversity 

' * . ' y \ 

Introduction / , 

/ 

The purpose. of this paper is to survey^ the state of the 
art, from both a hardware and software technology .point 
of view, of the technical and delivery capabilities for 

*' / i :.. 

Data drganization , 

• \ 

Data Tabulation,' and 

Data Presentation,, I 

, | 

These areas are central to improvinjg access to and 'use 
of machine readable Census Bureau data.. In the area of , 
data organization, we will^talk about $he state of the 
art in Data Base Management Systems (DBMS); in the area^V^ 
of data tabulation, we will talk about the state of the 
art in Generalized Table Generator Systems; and in the 
area of data presentation, we will talk about the state 
of the art in Photqcomposi tion and Computer Graphics. 

The sectjruTTsv that follow examine functional capabilities 
of each of the three individual components; the integration 

-of the three components into a total system; anV the delivery 

1 j 
of the system .capabilities to the end user. • 

,.. . 47 i ' 

-V 48 



2.0 Functional Capabilities ^ 

2.1 Data Organization 

The term "database" can be viewed from many different 
vantage points: its access, purpose, description, 
content and integration. But all definitions seem to * 

contain three essential and practical characteristics - 

• • .1 ' 

An organized, integrated collection of data. 

A representation of the data which is natural 
and convenient for users, with few ^restrictions 
or modifications imposed to suit the computer. 
Capable of use by all relevant applications 
without duplication of data. 
A data base management system (DBMS) is simply the software 
that supports such a database. The purpose of a DBMS is to 
allo'w users to deal directly wil^ data and relations of data j 
rather than be concerned with sometimes complex storage 



tructures . 



J 




AV-sitmmatized by (PAJ.M 75) , the facilities that a DBMS can be 
expected to provide 'are: 

1) , The controlled integration of data .to avoid the 

inefficiency an ^ inconsistency of duplicated data. 
< , « .» . 

2) The , separation of physical data storage from the 

application logic using the -data to*aid flexibility 

and ease of change in a dynamic environment. 

• , * 

. \ ■ 

49 



ERLC 



**s 3)' A single control of all data permitting 

^ controlled concurrent use by a number of 

independent on'- line users. ) 
4) Provision for complex file structures 

* ♦ 

and access paths such that relevant 

* * 

relationships between data units can be 

readily expressed and data can be re- 

.. ■ a 



access 

' of data 



trieved most efficiently for a variety 
of applications. " • c 

5) = Generalized facilities for the rapid 
storage, modification, reorganization, 
analysis and retrieval of data so that 

« 

the use of a database system imposes 
no restrictions upon the user, 

6) -Security controls to prevent unauthorized 

ft 

access to specific units of data, types < 

of data or combinations of data, 
i 

7) Integrity controls to prevent misuse or 
corruption of stored data, and facilities' •'• 

i to provide complete reconstruction in X 
'.. the event of hardware or software failure. 

8) Performance both in a batch mode and on~ 
J • ^X^' tnat * s consistent, measurable, and 

capable of being optimized. 

9) Compatibility^ with major programming 

* * 
languages, existing source programs, a 

49 



ERIC 



' '50 



variety of hardware systems and operating 
systems, and data external to the database. 

Figures 1,2, and 3 summarize the capabilities, of various 
DBMS' s. , 4 V 

The data base approach is more^tkan merely a different 
computer technique involving the storage of data and the 
use of additional generalized software. It involves a 
new approach to \esigning and operating information systems 
and has far-reaching^e££etts well beyond the data processing 
activities . I^auTbase is a philosophy that regards* data 

to be managed just as other resources of the 



as a resource 
organization are managed. 

Described in terms of the CODASYL model, this is .accomplished 
by defining to the DBMS', through the facilities of a Data « 
Definitional Language (DDL) , the structure and format of 
data iff. the data base, the names and descriptions of the 
data, relationships among units of data, and the methods 

■ V 

of access to the data. This definition of the data base is , 
called the schema/^Data requirements of Applications pro- 
grams are also defined using the DDL^and are called subschema. 

This can be thought of as the user's view of the data base. 

* .. ■■ * ' 

Operations of retrieval, 'modification, storage and deletion 
of data- are accomplished through a Data Manipulation Language 

(DML); • ' % . * ...... 

. ; 50 

51 , . . • 



The DBMS is directly responsible for the physical placement 
of data onf the storage devices. A Device'Media Control 
Language is used by the system programmer to determine: 



1) choice of device by data ty$e 

i: 



2) physical biock size 



3) record placement 

4) overflow strategy. 

Fundamentally there are only two ways of accessing data fpom 

the. mass storage device. Either tfye physical address is 

known so that it can be retrieved directly > or if not known, 

the relevant part of the data base must be searched. The 

fundamental physical structuring alternatives* are quite 

.limited, although they *can.be combined in a myriad of ways, 
j . t „.„ \ • 

The most simple is sequential where the next record required 

is the next record on the* file; it is defined by its position, 

^and its address is of no consequence. Records cari| be chained 

together, with t^he address of the next record in the current 

record. " ■ v 

. ' ■ t * 

«■ 

Hashing and indexing\are both techniques which allow, direct 

V 

access to the desired Record, in some cases with just a 
single access to the file. 

The basic physical access methods "available to a database 
system are limited and do not, of themselves, provide the 
necessary complex, file structures . Instead these are. 

51 • . 



implemented by the use of logical structures defined in 
the schema and interpreted by the system software in terms 

" • 4 

of ^the basic structures. Logical dat'a structures can first 
be classified. as any of the following: 



r 



1) Simple:. All units of data are independent and' 



of logically equal significance. They can be 
either ordered or unordered. 



2) . Hierarchic: Units of data are dependent and 

c;an be logically arranged in a hierarchy of 
levels in which units have a single owner 
t and/ or own one or more other units. A 
hierarchical file is always ^rdered. 

J 

3) Network: Units of data are .dependent , ^ut in 
a more complex structure than^in a hierarchy, 
in which units have more than one owner, as 
well as own one or more other^units . " 



A variety of file . organizations are supported by database • 
management^ systems for both simple and hierarchical structures . t 
These can be thought of as second,- level , or logical structures, 
\since each corresponds to combinations or extensions of the 
fundamental physical structures. Such organizations include 
indexed,- inverted, multilist, ring, tree, and network structures 

4 

These logical data structures are then used to implement the 

data models supported by various DBMS' s-. 

52 - 



A hierarchical data model> is a collection of trees in whi'efa- 
the nodes are the record occurrences -'- in. other words, a 
.one- to-many relationship.**' / • , ' 

• ..* = -v.v/ 

This data model can be used in, two ways: ; 

1) The Selection criteria c&n be specified as 

v a. path through the tree. Some or all of * 

the records along the path are the desired 

records. Example - IMS (IBM). 

• ' ... 

*<• 

2) The selection criteria are specified in- 

dependenrly of the tree .structure • The 

tree is then' searched through the facili-' 

ties of an inverted index foj* the desired * , 
♦ • * 

records. Example - System 2000 (MRI). , % ' ' 

the principal disadvantages of this type of model is that 

v 

N it is often inadequate to accurately model the data* An 
example of its weakness is its inability to model a* geo- 
graphic lattice. - Also, the tree structure makes many 
retrievals difficult; If, however, a hierarchy- is an 
accurate data model and if ntost accesses can be expressed 
as straightforward tree Searches, it can be very efficient. 

* * * t 

The network data model allows for, many- to-many non-hi^r*- j 

archical relationships. The best. known of the network ' •♦■ 

; ' ' • . ~i t f y 

systems are those bftsed on the; QODASYL (CODA' 71) reports. 



4 - 

> 



f 

4 t 



Superimposed on a vaxiejfy of physical stoY»ge structures •* 
is ja logical structure called a Set- ring structure/ which 
links rec"ord occurrences./ An owner rec6rd can have many / 
members. A member record ^an. be associated with many 
owners in different sets. "The primary advantage of the 

i - « 

* — ■ 

network is that a wide variety of physical analogical 

» ,■'«. . 

structures are provided and they can model most collections 

r v of data" very well. There are many choices to allow for 

optimizing performance. T^lere are also disadvantages. For 

with all the alternatives , come the complications. A 

network model is very, complex, and a user must know a great 

♦deal about the actual storage structure to program efficiently. 

... , " - r ' ■ ; : * ■ ■ ■; 

Examples - DMS 1100 (llNIVAC) 
- .IDS/II (HONEYWELL) . • • 

'■; N ♦ ■ I DMS . (CylLINANE) ' j. 

♦. *T1m51 Relational data model is an approach developed largely 

'• in^the FBM Research. Laboratories at San Jose, California. 
; The iRQSjT sigarf leant papers have/ been by E. Godd (CODD 70 j . 
|; v ^ x % T'he ori^itajL/motj.vjett'fon for this approach was the need for 

v data^,i^ e P ende '^. Andy ; t% ; ne£d^td identify inconsistencies v • 
;r:V i 'V.':;;>y.^^;r..^i%lft tKe i|^t.abase',r.^trt . it soon .became apparent" that the- 

A ^Vel^ii^'nai^inbdelvrfefc'ause.-^ its jraeic simplicity,' cfbuld '. . 

^*^:z v \- ; 'well -rrrbviide i,al uhifyili& ^.truct^tiB v for* the design of any 

K/.v'vr ^ with: which to ♦ 



£esi]gn a schema and need not be concerned with the 
complexity of linkages, networks, repeating groups and 
indexes. * '' 

• ' f ' ' ■ 

The j*elational model is a, mathematical approach built 
around two basic concepts. The logical ^storage structure 
used is a relation- in third normal form, which is a type" • 
of relatibn with the optimal properties for use in a data- 
base. 

All data 'in/the relational model is viewed logically ^as a 
simple tabfe. This is easily understood by the layman and 
is suited fo^/display on ^terminals . Mathematically ^iese. 
tab'le£ are'* known as relations^ A relation of degree f .n f 
has the following properties ; v 



1) * contains f n f columns (known as' domains.).; 

2) ally elements in a given domain are of 
the same type ; 

3) °each* row represents an n-tuple of tlje 

relation and Contains 'n 1 elements ; 

4) the ordering of rows is immaterial; \ 

5) . all rows are distinct (there are no . 

duplicate tuples); and 

6) columns (domains) are assigried distinct 
names. 




55 



« - . * • 

In conventional terms, a relation can best be equated to ' 
a serial file containing t one recbrd type of fixed length. 
Thus , . a tupjle is equivalent to a record; a domain, to all ^ 
data-items of a particular type in the^file. . 

Tuples are identified by their keys, which are forfned from 

a combination of one or more elements, A tuple can contain 

#■ - 

more than one, combination of elements that uniquely defines 

4 * » 

r 

it." Each combination is termed a candidate key; the one ^ 
arbitrarily selected to identify the tuple is its primary 
key. « 

' \ . . ' • ' 

A relational model subschema- is very concisely defined. It 
need name only the delations and domains and indicate the 

primary keys. The user, is •♦not concerned with ordering-, 

• ■ ' . * -. • 

indexing, orCacGess paths so they need, not be defined. In 

addition, such aspects of. the physical data can be altered* 

v - 4 

without impairing -the applications using it. 

From the user.' s point of visw, and to a lesser extent the 
implementor 's , the major advantage of this approach' is its 
'basic simplicity. It is not a system that has ^grown simply* 

♦ \ - 

in an attempt to meet user requirements, but an approach 
from first principles' with a rigorous mathematical basis in 
relational calculus 

The relational calculus is powerful in its simplicity, and 

its conciseness and clarity make it easy to amend. Programming 

' •■ ' * 

56 , 



effort is reduced, particularly in updating, because . 
entire relations can be processed with one relational 

* » 

Calculus statement. It is well suited to query handling- 

» 

but it is not concerned with. output formatting. Because 
only relations and domains can beL addressed , access 
control problems are reduced, ^he relational calculus 

" i 

is claimed to be better suited to optimization and to 
augmentation with improved facilities than procedural 
languages based on relational algebra. 

By removing many decision-making responsibilities from 
t the user, the relational model )imposes. additional problems 
upon tJhie implementor. 

The us.er cannot define network, or hierarchical structures. 
This does not me&n that they cannot be used by the system 
if It is the most efficient means < of physical storage. 
Relations in third normal form could be storfed as serial* 
files. However, the number of extraneous fields would 
produce a great deal of data duplication with possibly 
unacceptable storage overheads. .The problems of amending 
such duplicated data have hot been eliminate^. Unlike the 
CODASYL set structures , there is 'a. .wide choice of methods 
of representing relations in physical storage. For example 
a relation , can- be stored by tuples. or domains, or can exist 
only as pointers fram other relations. THe ideal implemen- 
tation should be /sufficiently flexible to provide the. 
■ ' -f 57 



tructures best suited to the particular data and its 
sage^ If. it is not, the database administrator will 
(need control over the physical storage structures used , 
4or each type of relation. ' 

T„he disadvantages of the, relational model are not clear 
at this time since jfchere is a lack ^fjxractical experience 
of coinmfryc ial sy stems to draw upon, the notable exception 
being the Honeywell Multics Relational Data Store. Statistics 
.Canada has developed a relational system in which they are 
quite pleased" called RAPID, specifically 'for processing 
their 1976 census. INGRES is a relational DBMS that has 
been developed "at the University of California .- Berkeley.,, 

Why are we so interested in the relational model? The answer 

is simple; most DBMS' s available, today are designed to 

. - ' f ., 

optimize the retrieval of a large amount of information 

from a small number Of records. In statistical data 

processing, most often what we need is. a small amount of 

r « 

information from a verf large number of records. 

V 

r ' 

Data Tabulation \ • - • 

♦ 

i 

Tabulation of dafta is an integral and inevitable part of 

any statistical task. Whether the tables be created by 

experienced programmers, for iarge -scale censuses or by 

subject master analysts for studies involving small samples, 

this task is complicated, tedious and repetitive. In most 

58 . '.J 



1 

cases, a generalized tabulating system, simplifies the 
effort and enhances the final pre duct. 



A generalized tabulating system i^a series of parameter" 
driven computer programs designed to select, to restructure, 
to cross- tabulate and to display statistical data. The 
system is highly user-oriented through the utilization of 
a nontechnical , nonprocedural, compact, English- like command 

* 

3 

language that is. easy to learn and easy to use. Users need 
not have experience with conventional programming languages 
in order to produce a wide variety of tables with minimal 

programming effort* * rt 

• •• M ' . • • ' ' 

The four -important components in determining the success of> 
a generalised tabulating system are its , 

1) tabulating power, - - ■ 

2) ease of use t > 

3) environmental adaptability, and 

f 

4) acceptance . * 

Tabulating power refers ^to the ability of a system to produce 
tables as requested by its user. For example, the computa- 
tional and formatting ability, an<jf the 'lucid and aiesthetic 
display capability are fundamofltal to this 'criterion. On 
the other hand, the clarity of the documentation and the 
design of "the user language are central 'issues concerning 



the system's ease of use. Environmental adaptability may' 

s 

alsp play an important rol^ in' the decision of choosing 

X 



a tabulating system for installations which* do not possess 
large scale computers. Transportability,* memory require- 
ments, and processing efficiency have effectively eliminated 
many tabulating systems from being considered for adoption' 

t 

^ s 
Oth^ functional features of a tabulating system iuch as 

statistical capabilities , linkage to data base management 

systems and g/aphical display systems may -also, be critical. 

•' ' / 

Finally, a generalized tabulating system can have the power, 

be easy to use and be adaptable to the environment but then 
^it must be accepted by its potential users. In most cases, 
this means a change from. the practice of custom coding 
complete programs Vo the coding of simple parameters. 
Statistical and economic analysts like this because it 
means that they can produce their analytical tables indepen- 
dent of programmers. Programmers and programming managers 
seem not to like .a GTS because it stifles their creativity 
and minimizes their independence in the statistical production 
process. But, in order for a GTS to be effective, it must 
be used; therefore, it must be acc^ted. 

What is the st.ate of the art in Data Tabulation systems? 
.Figure 4 shows a^ selected list of. tabulation systems and 
some of their characteristics. Figure 5 shows a selected 
list of statistical packages ., with tabulation capabilities. 
Much of this information comes .fronLLFRAN/76) . 

.60 



Packages lil^e SPSS, BMDP, DATA -TEXT, and SAS are tall 
accepted and widely used, The,y provide limited tabulation 
capabilities in the sense of the lumber -of cells' that can 
be tabulated in one data p^ss and in their data display 

* 

options. But, they do provide the analyst 0 wi th- a broad 
range of statistical routines. 

♦ 

Also of concern i$ the ability to tabulate large census 
micro and m<fcro data files, and to format thetabulation 
ready f orWiblication . 

I 

Several national statistical offices are active in the data \ 
tabulation area. Statistic^ Canada is using four generalized 
tabulating packages. CASPER, STATARE, STATPAK and TPL. 
CASPER was developed in the late 1960's and caught on slowly, 
but it still has limited use. CASPER has been largely 
replaced by STATAtfE with its expanded capabilities and 
improved user language. STATPAK supplements STATAPE by 
providing interface capabilities with Statistics Canada's 
data base management system RAPID, mentioned in the preceeding 
section. 

• ** 

Statistics Canada estimates that 7 T 0% of alfc tables are 
currently being produced using general izecKtabul a ting systems. 
This figure includes the tabulations for their 1976^ Census 
of Population and Housing. . v 

, 61 



1 



02 




The U, S. Bureau of Labor Statistics released their Table 
Producing Language' (TPL) in 1974. Today, this appears to 
be the mosTt widely used generalized tabulating system in 

t 

the- world. It has been distributed to over 150 installations. 
Recently introduced at Statistics— Canada , TPL is already 

t 

gaining widespread usage. 

i 

As do ma,ny such systems; TPL uses a codebook or' data dictionary 
,.to define data variables j their names and their descriptions. 
This codebook is usually coded by a programmer familiar with - 
the data, it i's then used by analysts or other programmers 
for their table preparation. Data is then referenced by 
data name, just as with DBMS's. This is a very important. ; 
feature for table generators, because/it allows for data 
independence and consistency between programs and programmers . * 

4 # - 

Usage of TPL has increased to a level where today^ there are 
over 3000 references each month at the tflH computer center.. 
It is yw normal practice at BLS to perform all new tabulations 
with TPL. . • 

» • 

Sweden is using their f AB68 ; France their system called LEDA; 
and Czechoslovakia, ISIS. In May," 197/7, the Census Bureau 
released a « generalized tabulating system called GTS1 . 

Although these, table generators may be different in their j 
, language, the machines they run on, and their internal 

design, they possess one common thread - they arejill working, 

62 i . 

♦ 



63 



\ ■ 

\ 



' l \ ■ - ' : . • 

parameter driven generalized tabulating sys'tem^ 



2.3 Data Presentation 



Data Presentation, using the computer, comes in many forms - 
charts, graphs, photocomposition, microform, and publications . 
The objective is the display of^data in graphical or pictorial 
form to help users of the data discern relevant patterns, 

trends and relationships. Very few people who have used 

*» - • _ 

good charts and graphs would argue with the proposition that 
"A picture is worth a thousand data values." / 

t This paraphrasing of the old adage not only reveals the%ower 
of graphics, but the problem with graphics: for graphics 

technology to be useful, there must be data values to be 

i 

displayed. „This.is best achieved by integrating graphics 

• • «*• • 

and DBMS' s,' a goal which is much-discussed and little-achieved. 

This integration theme will be further pursued in this and 

the following section. . 

Understanding the state of the art in graphics requires • 
recognition of the dichotomy between graphics for data 
analysis and graphics for publication. | There are substantial ' 
differences in quality, precision, and aesthetics of the 
data presentation. At the level of preparing graphical 
output, publication-quality graphics is more expensive and . 
time-consuming than is data-analysis' graphics . Yet both/ 
sorts of graphics are relevant to the) use of Census Bureau 

♦ v 

data. 4 63 



4 * 

% 

* 4 

In data analysis, the emphasis is on quick interactive 
specification .and production of scatter plots, empirical 
and theoretical probability density functions and cumu- ^ 
lative probability functions, regression fits' r and time 

series. The purpose of the analysis is to' aid both 

\ 

understanding of the data's statistical phenomenon (type- ♦ 
of distribution, correlations) and the data's significance 
and meaning (demographic trends, relation between various 
social and economic indicators, etc.). 

The aesthetics of the data presentation are not overly . . 
important. What is important is the provision of easy to 
use, uncomplicated systems whose use can be quickly mastered 
by analysts with little or no computer programming experience 
Ease^of use includes integration of the system with, a general 
and powerful dat^ base system, so that any and all data of^ 
interest can be easily accessed. • 

.A number of such systems exist. The • success , of the systems 
is much less a. function of the straightforward graphics 

♦ m 

technology they use than of their integration of graphics 
andj, data. * • 

In publication of statistical data, the aesthetics, quality 

» \ ■ ■ 

and resolution of computer- generated images become very 

imponjfcaht, even critical. Crude p.lots which might satisfy 

and be useful to an analyst are unsatisfactory to many of 



« . 64 



1 
1 



the end users of Census Bureau data. Decision-makers and 
policy-makers in the public and private sectors who use 
the data have neither time nor inclination to work with . 

■ 

anything but the best that -can be offered. 

The state of the-art in data presentation is schizophrenic. 

V- 

there is the data analysis - data publication dichotomy. . 
In, addition, there is a broad gap between state of the',, • 
art and common practice: a gap broader than in most 
technically evolving areas. • On the one hand, there are . \ 
numerous examples of magnificient computer-generated charts- 
and graphics, many of them in full color. On the other hand, 
there aire precious few commercially available turn-key 
systems. As a consequence, state of the art work is done 
in but a few research labs, universities, and government 

agencies. ' * s 

\ 

• - v> r ■ 1 

There are ^several reasons. Doing graphics work requires the* 
integration of numerous hardware and software .components 
more so than regular interactive computing. Major investment 
in time and equipment are usually" necessary . As discussed-, 
in a later section, most graphics software is not especially 
portable, so program sharing is difficult.. Investment in 
graphics is of ten. treated as discretionary, so graphics 
development has lagged areas, such as DBMS, seen as more 
central* or crucial to many organizations' gQals. ^ 
•. 65 

) 1 / 

* 

1. • . . 



Fdr these various reasons, .the state of the art in, graphics 
is rather diffuse, quite unlike the DBMS and tabulation 
areas, the state of the art willM*e described from th'6 
viewpoints of hardware technology and software/system; 
technology. .• 1 * _ 

■ \ • " 

\ ■ 

v%' . ■ ■ . 

Graphics Hardware ' 

Available hardware for interactive "graphics (for data' analysis 
or preparation of publications) ranges from the ^40Q0 direct-- 
view, jkqrage, tube to the $100 ,000 high-performance line-drawing^ 
or v color raster display system. While even better price^nd 
performance are desirable, and expected, what we have is quite 
u§'able for the tasks at hand. - J * 

This is also true for graphics plotting devices. Small, 
inexpensive pen plotters and electrostatic matrix printer/ 
plotters produce very usable plots fox data analysis and for 
proofing of some types of publication material. Costs are 
often well 'under $10,000. High-quality proofing and final 
output can be- had using precision plotters or COM devices,' 

t i 

which cost from $150 ,000 to $300 ,000. It is possible, £rbm 
the hardware viewpoint, to produce complete camera-ready copy 
of pages including charts, graphs, maps, text and tables.. 
Color-separated negatives can also be produced. « 

66 » 



1 

t 



67 



Graphics software" technology has two major focuses: general- 
purpose graphics subroutine packages , and the applications, 
which are built using the packages. The packages are of 
two sorts; those* whose exclusive or major emphasis is 
plotting, such as DISSPLA (DISS) and /CALCOMP (CALC) , ,/and ' 

V. *" - , 

* ■ * 

those whose, major emphasis is interactive graphics, but 
still with the possibility of* producing hard-copy plots, 
such as GPGS (CARU 77) and G'CS (PUK 76). - The distinction 
between these two types of packages has already begun to 
blur: most of the newer packages , While still .identifiable 
as <5ne type or the other, also provide some (^perhaps limited) 
capabilities of the other sort. .These packages will continue 
to evolve, but they are already quite usable. , their basic 
purposes are firstly to hide all details of the display 
hardware from the programmer (much like a compiler hides" a, 
computer ' s -details). , and secondly (in most but not all cases) 
to allow^ production of ^complicated cfiarts such as timeseries, 
barcharts, and. pie 'ctiar&s with just a few subroutine cal^.s 

§ 

to the package. -The packages allow programs.! to draw simple 
plots to? be* written and tested in a few hours or less. The ^ 

«■ ' ■ V , * * 

packages also, allow simple interact ive^programs to be prepared 
in days or at "most weeks. , # , , " 

» ■ / 

Unfortunate lyT^lTt^le of the hardware and software technology, 
has been translated into 'turn-key systems which can be 



1 10 



i ■ * • .w, v m-^-- - - • -■ , v.- -. .-»■■-.- / ^ ■. ■ .... ... • . .< 



vinto>'use &iy&ig real problems .. 

-^v- r ; <\jyi$&oi&~ f xequir ijig 'fton- t.r iv-idl investments in system 
; .•• -\vi ' «> r ' ; . j- ■-' . 

' *?iitWrati^h and Application prpgrammdn^.. ..There are few 



(Bktfe^ti^Wl-The^fiTst is Tektronix hardware-land so£*ware> 



\ K * ' " •■' -- "■ ■■• ."•••.**•*' & ... ••• 

* which can^readiily be use^d fpr same types of data* analysis-. ; 

Costs, are low, and there "is a wide user -community . " General 

•■■ i- , 2^ - ' ' ' * ?. "■ ^ 'J ; 

Electric f s Genigraphics/.system can bfe\us^'d interactively to* - 
produce impressive color $1 ides "for presentations ■> and coitld 
be* modified' to produce output suitable for publications. It, 
is a specially-programmed, minicomputer-based., stand-alone .; 
system, which would be difficult to, integrate into an overall- 
publication or data analysis system, and costs in excess of 
$300,000. * The/ftnaU exception is in the drafting and design 

• area,, but. such systems are not usiable for analysis and publi- 
^cation of.Census type data. ' ' ' ' ' . • 



..V 



Resources for Graphics *" . ' • 

Simple graphics can be done -with little investment in peo^e ^ 

■ or' equipment: $15/000 f or *a Tektronix terminal^and hard-co^y 

unit tied to a large time -shared computer,, plus a programmer. , 

to work with the people who have the problems to ba addressed. 

• «< » . ■ ■ 

Quite a bit of leverage can be had in a programmer-rich ' 
^environment, simply by having the ^"grajphics progra'mmer'i train 

Other programmers. But if programmers are scarce or nonexistent, v * 
little can-be done with ' such, "eql^pment beyond the use, of; 

. • ' • V. , % \ :' 



s 



68 



ERIC 



, * 



**:•*:* r: V ;,v v.; 



V.; / 
"^^ed'^pa^ges-l-lor j)|o£t-ii\g smalfc; qualities (tens or : 

twenties) o£*data^Valui& . *H -V**' 

Significant graphics K V>e it :for d^t-a analysis or pu.blicatioh, 

• ' * ■' •'- ' f „ . ■; ■■ ' : ■ "* ' ""■ : • 

requires*. significant p$.pple;,and e^ipmerit resources. Most" 
. roa-}or computer jgraphics;' installations Kaye 5 to 10 staff , 
•members , an u d, equipment.>alaed frfbm, $250.^000 .to over $i,00Q,000. 

There are naturally a ; . r few exce£tioits, to; ihes ; e staffing; needs: 

: "r* \ *■ r - :^ ■■ ■ . . .... 

• a few installations ma^age" : with;: two; or # three ^exceedingly 4 

s ^ •■' ■ • i ■ ' ■ ' ^ i / * ;v 

dedicated and self -disciplined people , \ v , 



J9 



"" Re empha s i zing- what has been stated eaxlier,. graphics is unlike 

■ . ; ■■ ' . .'- ' •" t ' .. %£. ■'"< \ -.V. " •• ' T '" 

DBM$, and/ Data Tabulation because" i t^require-s more " integration 1 : 

£: ' ' - " « • ,. ■ •*. : - . .< * • r . • ;'. t 

of': system components, such as terJfenal hardware,: data 1 communi- 

•* ; ■ .. ■ '■ "'. •• v t \ : ' <■'■ . ;f *; . 

cations, plotters , systems software, and application' programs. . 

Thus it' can.be expensive and technologicaljy Challenging, ... 

'especially . when the graphics is further^ integrated wim"a . ■■ ,. t 

data .biase j n as described in the following se : ction. 



3.0 . Integration 



Data Base Management fSys terns , Data Tabulation Systems., ''and' ; 
Data Presentation Systems are. all. useful in their own; right. J 
Bu^ to develop user^oriented software for dealing with:.very,. 
large statistical data bases such a,s the Bureau of the Census> 
an. iilJtegration of the systems is absolutely essential. Figure 
6 shows the general sort of integration which is required. 
•' ' ' ' ' 69 • ' 



■ERIC 



70 



v 



\ 



1 



Data is entered thrqugh the DBMS ii>to the data base where ? j 
it can be edited and imputed. The Data Tabulation System } 
accesses the data through the DBMS, and stores the 'resulting 
tabufations back into the data base. The tabulation results 
can of course be immediately printed as tables for examina- 
tion, and can- be used as input to the Data Presentation 

* 

System for the preparation of charts and statistical maps. 
The common user interface allows users of the total system to 
deal with single, uniform sets of concepts, terminology, and 
procedur^ for carrying out data ' tabulation and data presen- 
tat ion . - ' - 

Each component of the integration is important. Tlje DBMS - 
Data Tabulation link allows, all data being tabulated to' be 
represented, stored', and accessecTin a uniform way. It is 
not necessary to write special conversion programs for data 
to be tabulated. The DBMS - Data Presentation link permits 
serious graphical data analysis and chart and map presentation 
v,to be, dojie . Some such link is essential, because the volume 
of dat£ involved can be quite high. For instance, a county- 
leyel cKoropleth map contains in excess of 3000 data values. . 

,,4 " 

A ten-year trend chart . of several monthly economic indicators 
contains 120 data values per trend -line. These data values 
canriot realistically be manually entered into a data presen- 
tation system. In an environment where the emphasis is on # 
' ease of use and high volume use, it is unreasonable to require 

• 70 : 

, -I. 

i . ! 



_oi>^^pect the writing of one-time special-purpose programs 
to convert data from a DBMS or acces£ method representation 
to that needed by a particular plorypackage. This simply' 
consumes too much programmer resource, and juxtaposes a 
psychological and financial barrier between the data analyst 
or publication producer and the computer data base. What 

< * • 

we need is a system with which the data analyst, publication - 
producer*?, and perhaps in some contexts, the decision-maker 
can sit down at a terminal, specify any required tabulations > 
and then interactive J. y^xamine the data in tabular and 
graphical form, with sufficient flexibility to ^llow experi- 
mentation with the data 'presentation. 

The high-level model of Figure 6 can be further refined in 
two directions - one for data analysis, the other for data 
publication. Figure 7 shows an expanded data analysis 
system. Data can be .retrieved , tabulated, analyzed, and 
presented in various, ways. With the possible exception of 
the data retrievals (which might be quite slow), all these., 
steps would be carried out ^interactively. 0 

# For publication ; work, the integration needs ar^ actually r 

c 

N » 

more comple^, as shown in Figure 8, This figure reinforces 
the centrality of the data base, and shows that a number of . 
subsystems (only some of which directly involve graphics) are 

N ■ 

needed for total computerization of the publication (that is, 

, Data Presentation) • process . 

71 



Existing Integrated Systems , ' . } • t 



A number of partially- integrated systems have been developed, 

t . 
but we knjow of none that are ^completely int«g£a^ed. IrKthe ■ . 

'data analysis domain, University of California^ at Berkeley's 

GEOQUEL/ INGRES, IBM's GADS and Los Alamos Labs ' , oil -lease 

system represent varying degrees of integration. * s GEOQUEL 



s 



(BERM 77) (Figure 9)*, a "geographic information system., is 
built upon the relational data base system INGRES (STON 76) . . 
Maps and data about geographic areas defined by the maps are 
stored in# t^e\relational data base, and can easily be displayed.' 

mt 

If "mapofusa" is a Is tat e- level map of the USA, then 

MAP mapofusa ON population- . 

causes the map. to be displayed with s tat?" population figures. 
A statistical map of the USA, using density of printed symbols 
%o show population and car density, can be obtained with 

* • 

SHADE mapofusa WITH, ^persons 'Is, "x", #autos *S '**" 



% 



Thus rather complex presentations can be obtained, quite simply. 

In addition, 'the underlying relational data base system allows 

1 

arbitrary retrieval and manipulation of data. 

The GADS system ^developed by Robin Williams - (CARL 74, WILL 74) 
and colleagues at IBM's San Jose Research Lab integrates a • 



*Shaded areas on ^he figures represent implemented capabilities. 



73 • • 

4 



ERIC , i 



relational data base of small area census and local data 
with an- interactive color raster display (Figure 10). 
The user gives data retrieval, processing, and display 
commands, and quickly sees the results. The emphasis is 
on geo-data, and on. information display using maps. 
Computer -.gene/rated data /can, if desired , be superimposed 

with a local map on the* display console. 

< \ 

Work at Los Alamos Scientific Labs (Figure ll)*by Phillips, 
Siebert, and others (PHIL 77) is used to maintain a data 



ase (using the S2000 DBMS) of off-shore oil leases in the •} 
Gulf of Mexico. Choropleth maps are created to show the 
status of various lease plot^^ A high : precision film 
, seconder is used to make color slides and prints. 



Statistics Canada has two partially integrated systems. 
The first one which produces working tables, utilizes RAPID, 
the relational data base system discussed earlier ,v_with 
STATPAK ,\ the table generator system that works with the 
data base. The second system does their photocomposition 
wijthout benefit -of the.dal!a base, using the table generator 
system CASPE$ # with some custom coding to interface with a 
Videocomp owned by a private contract firm. . ■ 

• . • i . • J . 

The Bureau df Labor Statistics 1 system uses CINCOM's network 
data base" management, system TOTAL with BLS's own generalized 
tabulating ilPL. Photocomposition is done using PCL, their^ 

■ \ . / • 73 



*4 



\ 



]5rint control language within -TPL. The resultant ^output 
is phototypeset using the Government Printing Office's* 
Linotron; Graphics work at BLS c^nsist-Syffrimar ily of the 
production of trend charts using DISSPLA. BLS is working 
toward a completely integrated system. 

The Census Bureau has two partially- integrated systems 
(Figure 14) . The graphics systems are oriented toward data 
publication but used also for some data analysis. There 
are data presentation systems for dot, choropleth, and 
statistical maps (JONE 77), and for bar (FREE 76), pie 
(JOHN 76) and time-series or trend-line (SPAI 76) charts.^ 
The time-series chart system is integrated with a special- 
purpose DBMS for maintaining the time series CBUSC 76) . . 

♦ 

The second system, GTS , (Generalized Tabulating System) 
. (GENE 77) tabulates sequential files according to retrieval^ 
and processing requests. There is a flexible capability 
for specifying the details* of how the table is to be pre- s 
sented with a line printer. 

It is interesting to observe that v none of these systems 
comes close "to achieving a full integration of data manage- 
ifient , data tabulation, and data presentation systems. The 
highest degrees of integration are for small, limited-purpose 

p • * 

systems. * There appear to be several reasons for this. * 

i * 

74 



V 

\ 



N 



1 

System Integration Problems |" 

The most fundamental problem is the quandry presented by 
the following^ two statements: ^ 

» 

1) It is difficult to integrate existing systems | *. 
which were not initially designed to be integrated. 

2) It is expensive to develop new systems. 

♦ 

* 

j 

The net is that existing , " already-developed subsystems are 
generally not directly usable in building an integrated 
total system. Adaptation* and modifications may be 'feasible, 
and are preferred, for economic reasons, to starting completely 
from scratch. In fact, ^however y the ease- of use objective is 
usually^ best met by starting a system design project without 

♦ 

a commitment to using or adapting existing software. This 

allow^ the development 'of a f coi£eptual whole with an \ntegrity 

of its own, unfettered by the need to compromise the design's . 

r 

clarity (hence ease of use) for the sake of using existing 
software. . 

■ {' 

System integration is also hampered by portability antl standards 
problems. Just the right graphics system might b.e available 
on computer "A", but unless the programs can be moved to , 
computer "B", they are relatively useless. 

We believe that it is possible to build an integrated system 

* » 

to use a large, statistical data base such as £he Census 

♦ 75 



Bureau's. The subsystems are understood, some integration 
has already been achieved, and . tfre hardware, software and 
systems technology is undersjgcfck. What is needed, is the 
commitment and resources. It can be done! 



System Delivery 

Once an. integrated generalized information system has been 
developed, the next issue to be addressed is that pf delivery 
of the system capabilities to the end user. 

There are all sorts of users of Census data - large and 
small, private and government, business and industry, academic 
and commercial, with or without technical interest and with 
or without 'computer and programmer resources. Therefore, we 
must consider a broad spectrum of delivery possibilities when 
we consider making our dat# available to end users. What are 
'*the possibilities? 

First, the data "can be made available' through human inter- 
mediaries. This can be done by having some data users' 
service organization whereby tlje~-U5er's particular request* 
for data can b*e satisfied. 

The second possibility is theNJis tribution of software' with 
The third and last possibility would be the es/tab lishment of 

v ? ^ • • , . .. 

public- aacess data centers, whereby, through data communication^ 
\ 76 




facilities' the users would have access to tn"e data base 
and the software tools to solve £heir data requirements. 

But,' there are factors affecting system delivery whatever 

* » • 

approach is chosen. Hardware compatibility is the first • 
problem and is affected by such things as, internal formats, 
word sizes, and peripheral and ancillary equipment. 

Another problem to be encountered is that of software 

» . » • 

• r 

portability. Delivering software to operate on different . 
computer systems is -quite a challenge. But, we know that 
this is possible by the example of numerous successful 
models, including SPSS, BMD , .S2000 , N^VRK IV, IMSL, DISSPLA, 
and PLOT- 10. These systems have been* successfully distri- 
buted by overcoming two barriers: the technological one of 
achieving CPU, language, and operating, system independence, 
and the managerial one of providing a disciplined system 
for creating, updating,, and disseminating system documentation 
fixes,, and upgrades. Neither barrier is trivial /although 
technologists tend to dwell on the* former , leaving the latter 
•to dhance r * 

The , technological problems are perhaps a bit more complex,' 

than those addressed by the distributors of the above models, 

because>rwe are concerned with integrated systems which require 

diverse computer resources: large file systems, large matin 

memory, various graphics' devices . 

77 

/ 

% * ' 

* /<5 




J 



Many problems a*e resolved by using a standard programming . 
language, sucM as' ANSI FORTRAN iy or ANSI COBOL. But some 
operating system interface matters, and w6rd size/precision 

problems, still remain. They are generally ijesolval?le by 

"i 

programming' so as to isolate the operating system or com- , 

puter dependencies to a few subroutines which are recoded 

« 

for each new environment. 

If a program which is to be ^delivered requires a DBMS, there 

are two choices: deliver. the DBMS also, or interface to. a 
♦ 

"standard" DBMS. Unhappily there are no standards for DBMS's, 

• > 

although several commercial systems (such as MRI's S2000 and 
CINCOM's TOTAL) have been implemented with different manu- 
facturer's computer systems. * The CODASYL report (CODA 71)- '• 
has had a major impact on DBMS's, and ma,ny DBMS's conform at, 



ERJC 



least to the spirit of the report's recommendations ♦ Thus 
there are a number of similar, but not equal, DBMS's in 
existence. This is not enough. for software! distribution , 
j\ist as having ten or so FORTRAN dialects is not enough; 

For passive output graphics, there are two dominant de fa^to 

« 

standards: the "CALCOMP routines" (GAIX) ,« and DISSPLA (DISS). 
For interactive graphics, there *are a number of widely-use'd 
device- independent packages, such as GPGS (CARU 77), GCS 

* 

(PUK 76) and GINO-F (GINO) . None has achieved preeminence. 

There is also a proposed standard, developed by ACM/SIGGRAPH , 

which may be officially adopted by ANSI (perhaps in modified 

78 - > 




/ 



J, 

form) wi1*hin the next few years (GSPC 77) 



•To summarize software portability, it is fair to say that 
standard FORTRAN or COBOL programs which d<3 processing. 

^/and simple I/O can be "ported" to "new computers quite 
easily. Programs requiring the services. of DBMS f s or 
graphics packages are not nearly soeasily moved. 

The thifd problem is that of data portability. In sequential 
summary tape forp, data portability is very do -able since 
there are no real technological problems with the existing 
standards for tape format, labelling and coding? ^Standards 
also exist for data communication, therefore, data transmission 

• . •. . . • - < 

poses no technological problems. 

* 

Data portability using DBMS structures is also do-able with 

^ portable DBMS. Today, we know of none that are not some- 

> 

what machine dependent. 



These portability problems largely disappear if the public- 
access multi-user service centeir approach is selected as 4 °\ 
th| delivery vehicle. Statistics Canada has an interesting / 
unique approach to delivery of their economic timd series 

data base through a system called CANSIM. Through a joint 

. * 1 , r 

government/private enterprise venture, CANSIM is made available 

* 

through commercial, time*- sharing services throughout Canada, 

the U.S. and even Eurpope, Statistics Canada maintains the 

' 79 



ERIC 



SO 



master data base at a "parent", time-sharing center in 
Montreal. "Subscriber" time-sharing services contract 
with Statistics Canada for $1500 a month for the right 

IT - ^ 

to market the time series that they download each day 
from the parent center. Subscribers are contractually 
, obligated to update data bases within twenty-four hours 
of an update of the master file in the "parent" center. 

« 

Each time-sharing service makes available CANSIM software 
made available to them by Statistics Canada, as well as, 
, any software that they! have developed for their users. 

/ • ' ! ' 

Statistics Canada uses, an AMDAHL 470 V-6 computer which, is 
plug- to-plug compatible with the IBM 370. Only software 

'.developed for their maq'hin% is distributed. Subscribers 
with machines outside the IBM 370 family must assume the 
Responsibility of converting the CAtfSIM software to their 

* environment . ■ 

This approach to delivery Of data appears to work very well. 
It is but one of many possibilities 'involving a public-access 
system. • •* 

No matter how end' users access the computer, there must be 
good interactive access to the integrated systems capabilities 
■It is crucial that these capabilities be easy to use* Other- 

•<wi\e, they may -not be used"at all! r We know that ease of use 

■ » .* i , 

i • - ' 

•t. ' . ■ 80 : * ' 



.is easy r 4ti^allj aly>ut,*but hard 'to achieve.' Both the 
conceptual system model, , which the 'users must master, 
as well as the details of the- command language syntax, 
error messages ^and prompts must be carefully designed. 

* 1 . ' * v ' f 

To achieve systems that are' easy to use requires direful 

top down design and planning. .We are beginning to Tcnow 

how to do this (FOLE 74, CHER 76-) 'but our skills are not 

nearly perfected. t What"'we do know is that. making systems 

'easy t^ use is expensive of bfoth people an\ computer time. 

We know that redesigns of command languagesVre sometimes 

necessary, and that use of general-purpose tools can make 

the implementation and modification tasks faster and less 

expensive. There is certainly the possibility- of • an easy 

to use command language patterned after English, but in a* 

constrained forni| Such systems are likely to be common • 

in the next decade. t 

«■ • 

At the moment, only the military, large- vendors/users , and 

a few research labs seem, concerned about computer • system 

ease of use. »We must be prepared to join in/their conqern 

study their systems, ^nd learn their craft. 



/ 



84 



•X 



% ■ v 



5.0 Summary \ • ■. .. s 

••.'•.* .' ' ■ . i- ;-«. • ■ ' - 

, , > f $ysteins''haye : bee 1 n developed tfor DBMS, GTS , ; and Photocomposition 
' , m computer ^nerated graphics.; Today /completely integrated , 
systems do not '". exist' but partiaily. integrated special purpose- 

• l # 

systems, do exist and hW demonstrated the technological 
\ feasibility of developing a completely integrated generalized 
information system. ."All it . takes* is resources, a management 
• ' commitment ai\d direction. '* It' can be done! » ; , . 




J ■ 



1 



4 



■ \ 



3 



■ a 



ERIC 



« * 



Host Language CODASYL , 



" a 


' " ' ' *7; — 

DMS1100 : 


. IDMS 


Honeywell ID8-II 


vrU 

p 


mMiv/ap unn 

JUWIVMV HUM 


|D|| ^7f\ 

• 


nOUUU 

r- a 


Item Description ^ 


COBOL Oriented 4 ■< v 


Host -Language Like 

# 


COBOL Lite . 


Logical 


Network' 


"Network . 


Network 


Physical 


Painters 


Pointers 


<. Pointers 


AcctwMtAods. 


; Direct • r 
Hadied 

ISAM ,. • x 
v. Network 


Direct 

Hashed 

Network 


Direct 
Huhed 
ISAM 
Network 


*l6.B Crfation 


Uier Programs 


User Programs 


XjMtty 


t 1 ; 


Yas 


No 


_ L 

" . h Y«» ** 


Report Generator 


Yes 


\\ n ° 7 • . *;« 


* No 


Host Language 


* COBOL * 
FORTRAN 


; — 1 

, €OBOL 
FORTRAN • 




Multi-thread 


Yei 






Security \ % * 

L s — 


None 


Thru Subschema m 


• Password 


9at% Validation 


: * None 


None 


Yes 


Recovery 


Full-Sca|e 


► . Full Scale 


' Full-Seals * , 


Surveillance 


Log Tapes and 
, Statistics . • - 
Collection" 


, v None 





*** 




Host Lang Je>g« NontCODASYL 

Data Base Management Systems 






• 


• 


% Burroughs 
| DMS II 


Clrtcom 
TOTAL 


IBM 

IMS 


MRI 

System 2000 


Softwere AQ 

AD ABAS 


•CPU *' 


| Burroughs 
• 6700,7700 

■ i — 


IBM 370 M 
CDC 6000 
UNI VAC 70 


IBM 370 


IBM 370 
UNIVAC 1100 • 
k CDC 6000 


IBM 370 

— 1 — , 


Item Description 


Host Language 
Like 


Host Language 
Lika 


. 'Host Language 
• Like *, 


Host Language 
Like 


Host Language 
Like y 




• Network 


Multi list „ 


Hierarchy 


Tree Structured 

<<• 

/ • 


Almost 
Reletlonel 


PmVskat 


Pointer * 


Pointer * 


Adjacency 


Adjacency 


Pointers 


Access Methods 


Direct, Hased 
ISAM, Hu vector,* 
Nefwork 


Direct 
sequential 
Hashed . 




Direct 
Sequential 
Inverted Indices 


Direct 

Inverted — 

Indicis 
Heshed 


D.B. Creation 


User Programs 

> 


User Programs 


User 

Programs 


Utility & 
, User Programs 


Utility & 
User 

Programs 


Query Language 


Yes 


Yes 


Yes/ 


Yes 


Yes < 


Report Otntrftbr 


Yas 


Yes 


Ye* • 


Yes 


Yes 


Host Language 


ALGOL 

PL/I 

COBOL 

e^ 


Any laog. . 
with sul> 
routine calls 

*- * . ~m — - 


COBOL 
* Assembler 

\ ft 


COBOL 
FORTRAN 


COBOL 

FORTRAN 

PL/I 

Assembler , 
ADASCRIPT 


Multi-thread 


YH 


Yei * • ■ 


Yes, 


Yes 


r« 


Security 


None * 




Yes 


Yas 


Yee 


Data Validation 


Som# , 


None 


None 


Some 


Some 


Recovery 


Full-scale 


Some 


Yes 


Full Scale 


Full-scale 


*' Surveillance 


Soma 


None 


Some 


Log Tapes 


Log Tapes 



PDP-11 
K . - Hor»ywtll $000 

Mr,'"" 



IBM 5yittm73 
NCR Csntury 



'ERIC 



y«rlah V70 



I- 



Figure 2 



8£ 



\ 



Y 



SELF-CONTAINED 

Data Base Management Systems 4 




1 

V 


\ 


Computer Corp. of America 
Model 204 


Mud* Technology W 
Dat«/C«nftpl 

, ^— p * 


* 

TRW 
GIM II 


* rpn v 

• 


|Q|I 4>fA 

. IBM 370 


IBM 370 


IBM 370 
UNIVAC1100 
PDP-11 \ % 

\ ■ 1 m i 


Item Deter iptipn* 

i 


Chtrecttr 'String \^ ^ 


Chtrecttr String 


Cherecter string 
Numeric 


Logical 


Almost Rtlttlontl 


Multi-list 


Almost Rtlttlontl 


Physical s 


Pointers 


Adjtcency 


Pointers 


* — • — • -j, . .I... 

Access Metnoas y , 


Sfquentlai 

lnvtrte<rindtces 

nMhed 


* Invtrted Indicts 
. Stquentlel^^ 


Inverted Indicts / 


D.B. 6reatlon 


Utility,, 


Utility 


- Utility \ 


Query Language 


Yts 


Yts 


y- ■ \ 


riapon uanaratoc 


NO 


No 


c ^ ' 


1 

Host Language 


COBOL, FORTRAN/ 
PL/I, Autmbkr 


Any Itngutgt with ' 
subroutine cell 

♦ 


COBOL tnd Own \ 

. -4 


Multi-thread 


Yts ' , 


Yts 


Y«r 




Security 


Yts 


Yts 

u : 


Yes 


Data Validation 4 


Yts 

tit— 


No 


Yes 


Recovery 


Some * // 


Yts 


V Y " - ■ — - 


1 Surveillance 


Yts 

* • 


Yts 


\ Some 



''/» Figure 3 



86 



00 
On 



87 



9 

ERIC 



Data * 

Tabulation 

Systems 



Package 



GTS-1 



CO- 
CENTS 



TPL 



CROSS- 
TABS II 



CENTS 
AID II 



Organization 



SSD- 
CENSUS 



ISPC- 
CENSUS 



BUREAU • 
OF LABOR 
STATIS- 
TICS 



CAMBRIDGE* 

COMPUTER 

ASSOCIATES 



DATA USE 

and access 
'labs 



Machine 
Availability 



UNIVAC 
1100* - 



MANY 



IBM 360/ 
370 FAM 
ILY 



IBM 360/ 
370 FAM 
ILY 




Minimum 
Resource 
Requirements 



85K WORDS 
MEMORY 
2M WORLDS 
DISK 



IBM 370- 
24K BYTES' 



"J 



200K BYTES. 
(NEEDS 
DISK SPACE 
FOR INTER- 
MEDIAtE^. 
RESULTS) 



80K BYTES 



100K BY 



Source 
language 



COBOL 

t 

r 

COBOL 



BAL 



Cost 



Comments 



ANNUAL 
LEASE 
16120 
MONTHLY 

)j$600 ' 

PURCHA6E 
FROM NTIS 
$600 DO- 
MESTIC / 
♦1200 . 
FOREIGN 

ANNUA1 
I MAINTEN-' 
! ANCE 
$500 



DEVELOPED FOR 
CENSUS INTERNAL 
USE ** . 

LANGUAGE-VERY 
poop V „ < 

"■epficiency^ex--, 

CEDENT / 

DEVELOPED FOR 
AlOfOR INTERNA 

TIONAL DISTRIBU- 
TION 

LANGUAGE-POOR 
EFFICIENCY-EX- ' 
CELLENT * ' 

PRESENTLY IN- 
STALLED |N OVER 
160 INSTALLATIONS 
WORLDWIDE 
LANGUAGE-EX- 
CELLENT 

EFFICIENCY- ' > 
GOOD 

DEVELOPED FOR 
CROSSTABULATION 
OF SURVEY DATA 

LANGUAGE-GOOD 



I 



SIMPLE^SPECIFI- 
CATION OF TABULA' 
TION FROM LARGE 
COMPLEX FILES ^ 



Figure A 



LANGUAGE-VERY 

,ck)OD» 

i V : 



88 



Statistical 

Packages with 

Tabulation 
CapabUiti. 



00 



EMC 0J 



Package 

4- 



SPSS 



BMDP 



DATA- 
TEXT 



SAS 



Organization 



* Machine 
Availability 



SPSSjlNCj 



HEALTH 
SCIENCES 
COMPUTING 
FACILITY 
SCHOOL OF 
MEDICINE' 
U.C.L.A. 

DATA TEXT 
SYSTEMS 



INSTITUTE 
OF STATlS- 
' TICS, NORTH 
CAROLINA 
STATE 

I 



MOST 



MANY 



IBM 360/ 
370 FAM 
ILY 



IBM 360/ 
370 f AM 
ILY 



MlnluHim 
Resource 
Requirement* 



IBM 370-150K 
BYTES. \ 



IBM370^160|( 



BYTES 



260K BYTES 



160K BYTES 



■ Source 
Language 



Coat 



f 



Comments 



FORTRAN | COMMERCIAL i 
$6000 + $2000/ 
j YR OPTIONAL 



NONPROFIT 
$1600 +"$800/ 
YR OPTIONAL 

ACADEMIC 
$1000 + $600/ 
YR OPTIONAL 



FORTRAN 



FORTRAN 



$100 



PUD/BAL 



COMMERCIAL 
$1000 + $400/ 
YR RENEWAL 

NONPROFIT 
$760 + $300/ 
YR RENEWAL 

' ACADEMIC 
$600 + $200/ 
YR RENEWAL 

COMMERCIAL 
$3600+^1600/ 
YR RENEWAL 

NONPROFIT 
$26 00 + $ 1600/ 
YR RENEWAL 

ACADEMIC 
$760 +$300/ , 
YR RENEWAL 



Figure 5' 



COMPLETE PACK- 
ASEFOfr ANALY- 
SIS OF SOCIAL 
SCIENCE DATA 

LANGUAGE- 
GOOD 



GENERAL STATIS- 
TICAL PACKAGE 
FOR TOTAL DATA 
ANALYSIS 

LANGUAGE- • 
VERY GOOD 

STATISTICAL AN- 
ALYSIS AND DATA 
MANIPULATION OF 
SOCIAL SCIENCE ' 
DATA 

LANGUAGE- 
GOOD 



GENERAL STATIS- 
TICAL ANALYSIS- 
ESPECIALLY LIFE 
SCIENCES 

I LANGUAGE- 
GOO^ 



INTEGRATED 
SYSTEM 

(Figure 6) 



C 



Data Base 



D 







t 

Data 
Entry/Edit 


Data 

JteTRI^VAL 




% . - 

* 



Data Base 
Management S ystem 



Data 
Tabulation 



X 



Common User 
Interface . 



Data 
Presentation 



/ 



> © 
ERIC 



9i 



/ 




r 



o * 
ERIC . 



t 



Data Base 



i 




Data 
Tabulation 



Data 
Analysis and 
Computations 



1 



Data Presentation 



CO 


— 1 




CO 


H 




m 


o 


> 


m 


o 


> 


H 


11 u 




TT 






I 


m 


H 
















o 




o 


~o 


> 






r 




1 o 




o 

H 










(/>< 




< 






m 





s 

r 
m 



GENERAL 



INTEGRATED DATA 'ANALYSIS SYSTEM 



Figure 7' 



r 



92 



\ Data Base j 



Data 




Data 


"Retrieval 




Tabulation 



t 



DATA PRESENTATIQt 



CO 

-< 

w 
o 



-o 



o 

H 



> 

C/> 



X 

o 

73 
O 
"O 

r 
m 

H 



CO 



m 
> 

H 

C/> 



> 

5 

70 
—i 
C/J 



3) > 



m 



a m 

CO 

c~> 
> 

H 

C/) 



.. GENERAL J - . 

INTEGRATED DATA PUBLICATION SYSTEM 



Figure 8 



0 

ERIC 



93 




J 



Text ■ 




Processing 





Page 
Layout 



Publication 
' Composer 



Photocomposition 




r 



ERIC 



; ; • Y k ' < V - • / % ) '/ :• v; '■ . / j >; /' ,;• : < .<> X' 





» Data 
Analysis and 
Computations** 



University of California - Berkeley 
GEOQUEL/ INGRES 
INTEGRATED DAT// ANALYSIS SYSTEM t 

Figure 9 ' .'*<••.'. 

•94*. . ■" 



•r 



Data Presentation 




5 \ 




Data Base Files 





Data Analysis' 

« AND 

Computations 



V 



, IBM-GADS 
INTEGRATED DATA* ANALYSIS SYSTEM 

* 

Figure 10 • 



1 



o 



Data Presen 



.0 t m 



R 

r 



6* 

S. 



rn 

ZD 

m 

c/> 

</> 



70 

< 
m 



© 
0 



' , ' / 





9ii 



• Data 
Tabulation 



Data 

Analys^ and 
Computations 



LASL'Oil Lease 
INTEGRATED DATA ANALYSIS 

Figure 11 




Dai 


rA 


Presentation 










in 










m 






O 


n 


m 


CD 


2 


09 






ON 


30- 

m 


JLL 


r ■ 
m 








CO 


m 


c/» 






o 


CO 


3D 


* 






HA 


t—t. 
o 


~a 






1 




z 


r 








H „ 




o 










o ' 


■H 










c 


</) 










73 












< 












m 












C/) 







l. 



J- 




97 



ERIC 



•••;..V:.. 




/ 




1 




vO 






tAlulATiiOH 



—I 

o 



CO 



o k/) 



a- 



o 

"0 



H 
X 



M ^ 

m to 



JO 
—I 



X 

> 

H 



rn 
a 
o 

— I 

c/> 



if* 

I 

• •. ;. 
■■■ v-\-;a 



STATISTICS CANADA 
INTEGRATED DATA PUBLICATION SYSTEM 

, Figure 12 



1 



Partial Shading - 

Partial Implementation 

7 



Text 






Processing 




LAYOute 



Pu&ticAIJpN 



COMPOStW 



I 



PHOTocoMpeiiriiQSi 



98 



E 



Data 



I 



'NpTWORK 



Data Pupeniatio^ 



m 


GO 


C3 


o 




00 






-i 


-< 


O 






> 






o 


3 


H 


o 


rn 








m ' 


















O 


? 


o 


r> 










r 




PL 


HA 


HA 








o 




m 
















— 1 


H 


H 








? 




X 


cn 






.1; ' ■"" 




"0 












■ 




(/> 




no 
















c/> 











9 

ERIC 



BLS • 

INTEGRATED" DATA PUBLICATION SYSTEM 

L Figure 13 

99 \ ' r 



-1 1 — - 

Text 

Processing 




- Page 
Layout 






« 



HPobucatiqw 

COMPOSE* 

3r 



Photocomposition 
(Linotron), 

(VlDBOCOMP) 



ion 




7 



/ . 



Tabu 


■v \ . ''/■■■ , 'i\ 

LAT1QN 


* 






Data Presentation 

Ml 




xoi 

A 



eric 



• census bureau 
integrated' data publication system 

Figure 14 . 



Text 
Processing 




Page 
Layout 












r 



Publication 
Composer 



1^2 



Partial Shading » Partial Implementation 



REFERENCES 



(BERM ^7) 



Beririan, R. and Stonebreaker , M. ' r ' 
"GEO-QUEL - A System fdr the 
Manipulation and Display of 
Geographic Data,"' SllGGRAPH ' 77 
Proceedings . published as Computer 
Graphics TT , 2 (Summer 1977) ,186-191 




Busch, R.A., TIMEBASE: Time-Series, 
Data Base Management System , an3 
TIMEBASE Processors User 's"~Guide , 
Graphics Software Branch, Systems 
S6ftware Division, U.S. Bureau of 
the Census, Washington, D.C. , 1976 



(CALC) 



CALCQMP riot Package Reference Manual , 
CALCOMP, Inc.. "" ' ^~ 



(CARL 74) 



Carlson, E. 
Mantey , P . , 
of an Int 
and Disp 
Congress"! 
137 




, Bennett, J., Giddings , 
"The Design and EvaluatiSL 

active Geo~data Analysis 
System," Proceedings IFIP 

, 1057- 106T: North Holland, 



(CARU 77) 



(GHEJl 76) 



Caruthers , L. , van der Bos, J., van Dam, 
A. , "A Device- Independent- General Purpos 
Graphic System for Standalone and 
Satellite Graphics," Proceedings of 
SIGGRAPH '77, published in Compater 
Graphics 11 , 2 (Summer 1977) ,. 112-119 

Cheriton, D. , "Man-Machine Interface 
Des,igii for Timesharing Systems." 
Proceedings ACM 1976 Conference, 362-366 



(CODA <^L) 



CODASYL Data. Base Task Group, April 1971 
Report . Association for Computing ~ 
Machinery, New York, N.Y. 1971. 



(CODD 70). 



Codd, E./."A Relational Model of Data 
for Large Shared Data Banks," CACM 13 , 

CJunev.l97D), 377-387.. « 

. 97 



* ■ 



J 



/ 



/ 



.'■ iFOtB 74) 



(FRAN l6f 



(FREIE 76£ 



(G£Nl/ 77) T 



V 



/ A 
/ . ■ '/>. ■ 



DISSPLA , integrated" Software Systems 
Company, San' D'xegQ/.Califorhia. 



i 



Foley, J-., and Wallace, *V. , "The ' 
Art of Nd'tural Graphic Man-Machine :. 
Conversation,," Proceedings IEEE 62 , 
4 (April vl9f4.)., 462-470. " 



'Francis', I . et . al. , /'Languages and 
Programs for Tabulating Data From . 

-•Surveys," Proceedings of the Nint 1 
Lntfe 



: ..tferface Conference on Computer / 
Science and Statistics , 129-134,/ 
April 1976. •.' v • '. • ;/ . 

< " , • y« \ t 

Freeman, J. , BARCHART : ' A Gejier-al- • 
Purpose Plotting Program , Graphics 
Software. Branch, Systems Software f 
Mvision ; , Bureau of the # Census , 
Washington, D.C.,, 1976/ 

r 

Generalized Tabulating System - 1 , 
Bureau of the Gensus , 1977,. 




(GINO) # 
(GSPC 77) 



GINO-F Reference Manual , Computer- 
Aided Design Centre^ Cambridge , England. 

/ " x * ' ■ 
Statu s Report of .the Graphics Standards 
Planning Committ/ee of ACM/SIGGRAPH . 
Published as Computer Graphics TT (3) , 
Fall"' 1977 . 



0 

r 



•ERjfcV 



/ (HEIN 77) 

/ 



(JOHN. 76) 



\ 



Heindel L./and Roberto, J., LANG-PAK - 
Ah Interactive Language System , 
AiAerican/Elsirier, New York, N.Y^_1975 

/ • . fc 

Jonnson, J. A. , PIECHART-: A General 
Purpose Plotting Program , Graphics - 
Software Branch, Systems Software 
Division, Bureau of 'the Census,^ 
Washington, D.C. , 1976. y 



98 



104 



(JONE 77) 



JJones, P. A., MAPS , Graphics Software 
Branch,, Systems Software Division, 
Bureau of the Census, Washington, D;C. , 
1977. . v 



(PALM 75) 



Palmer , jl . , Data Base Systems: A 
Practical' Reference ; Q.E.D. Information- 
Sciences, Inc., 1975. . 



{PHIL 7.7) 



Phillips, R. , "A. Query Language for a 
Network Data Base with Graphical" 
Entities," Proceedings of SIGGRAPH 
'77, published in Computet Graphics 11 , 2 
(Summer 1977) , 179-185. : .. ' > 



(PUK 76) 



Puk\ 'R.F. , The 3D Graphics Compatibility . 
System , U.S. Army Corps of Engineers 
Vicksburg, Miss., 1976. 



(SPAI 76) 



Spa id, G . L\ , TIMESERIES: A General 
Purpose Plotting Program , Graphics 
Software Branch^ Systems -Software Division 
Bureau of the Census, Washington, D.C., 
1976. 



(SfON 76) 



Stonebreaker , M. , Wong, E., Held, G. and 
Kreps/ P. , "The Design and Implementation 
of INlSftES," ,ACM Transactions on Data Base 
Systems 1 , 3 '(September* 1976) , 189-222. ' 



'(TAB -75) 



Table Producing .Language . Bureau of <Laboi 
Statistics , 1975. 



(WILL 74) 



"Williams, R. , "On the Application of 
Relational Data Structures in Computer 
Graphics," Proceedings IFIP Congress 74 , 
723-736, North-Hol'land, 1974. 



99 



105 J- 



4 THE NEEDS FOR AND AVAILABILITY OF USER SOFTWARE TO • * r 

PROCESS , AND ANALYZE CENSUS BUREAU MACHINE-RE/O^KE PRODUCTS!' f 

■ . y < 

r • * * 

'• • • J • ■ : • ■ 

v • ' . " -v "V 

. * Warren "G. Glijopse '. # , , 

• Data User Services Division '•. "\ . ■' ' ' A 

. ' U.S. BOr^u of the £ensus ' * 



Ii INTRODUCTION ' \ ■ N 



- 

' A steadily increasing volume of data produced by the Census Bureau 
is being made available to^the public* in machine- readable form. The user , t 
demand for these products continues to grow at an even faster pace, re-, 
fleeting the high level of interest in microdata, more datarted summary^ 
date, geographic reference. date, and cross-reference and descriptpr „ . 
type data beihfc made available in macfiine- readable form." UseVitemand. 
is further heightened, by growing sop histic ation of users in, using com- 
puters to»analyze statistical data. ' • 

Ten years , ago "both the" supply of and : demand for Cehsus^Bureau mactfiije- 
readable products was quite'jimited with only a few reels pf tape being 

; distributed per year. Since 1971, however, more than /p, 000 reels of 'tape 
have been sold representing more than $1.2 million in standard taperproduct 

t ft f .1 

sales. "It is estimated that the total magnitude of these tape products in 
the user domain- acquired through intermediaries', such as summary tape • 
processing centers, is 8 XoJO times this yolumg - as many as 200,000- - 
feels of tape. During the same period,; approximately $3.5 million in ^ 



1/Prepared for t he 1977 Joint American Statistical Association/U.S, Bureau of 
, the Census -Conference ,orr Development of User Oriented Software, sponsored by 
the National Science Foundation, November 8* 1977, Washington, D.Q, 

100 - ' 

- *. • ' 



• / * 

special tabulation projects* have been, undertaken for the 1970 decennial 
census data' alone. Most of these customized products have been delivered 
1 *- ■ - to the sponsor, and other interested users, in machine-feadable form. 

s - * - ' 

< k _^> These trends are expected to continue due to increasing^imounts of data 
■t\~ , /* being made available from a larger number 6f statistical programs and a 

. > * growing number of users making use of machine-readable products. ^ * 



* * 



" * 



i.-^ '< "There is little question that user accessible software plays an 

: ^ 

important role in processing these machine -reateble products for admin- 

istrative,' planning, and decisionmaking purposes' / This paper provides \ 

an overview and .perspective concerning /the needs for and availability of* 

' \ ■" ■ ■ 

user accessible software an„d related issues ^involving ways to improve * 

i 

access to^ftnd use of Census Bureau machine-readable products-. In this 

context f( users aire defined to be those persons engaged in the process 

of acquiring and processing Census 'Bureau machine -readable data. While 

* s" 
in part thi,s group includes -some Census Bureau staff , the larger universe 

of users are non- Census Bureau staff located in Federal agencies, State/ . 

and local government agencies, colleges and universities, businesses, 

and professional and trade associations as well as individual researchers, 

and others. User accessible software includes computer programs which 

may be acquired by users for use on their own computer as 'well as 

software which may be accessed through terminals and time- sharing 

systems. * , / 

r - - , 

It is, however, important to stress tjiat user software is only 
one of the essential ingredients necessary to achieve effective and 
efficient use of machine -readable products. Equally important issues', 

- . • 101 



1 



' 4 

ERIC 



107 

r 



..' '/ x ' : - > 

which mist be addressed concurrently, .include the structure of t)ie files, 
technical documentation, userjbraining , and manuals. Sixice the demand , 
for user software, is derived from the need to process and analyze machine- 

■ readable files, a summary of Census Bureau statistical resources in 
machine- readable form accessible to users is first reviewed in.this paper. 
This summary includes an assessment of past trends and current plans for 
developments involving production and dissemination of Census Bureau 

' machine-readable products. 

Secondly 1 , the need for user software, and related' materials, to\ 

facilitate access to and use^of machine -readable products. will be con- 

. . . - • ' • 

sidered. A review of existing Census Bilreau software is presented. 

An analysis 6f the unmet needs for user software is then considered. 



This includes an assessment of ) the problems . involved with access to and 
use df existing Census Bureau machine -readable products with existing. 
soft\kre including issues such as file structure, "documentation, universe 
comp*ability, cost-benefit issues, etc. Along with tMs, plans and 
"'. optiohs for developing user software and other aids to assist users in 
accessing and utilizing Census Bureau machine -readable statistical re- 
sources are reviewed. 

• H. CENSUS BUREAU MACHINE-READABLE STATISTICAL RESOURCES - 

To set the stage for a discussion of what types of user software 
are needed,' existing and planned developments for Census Bureau statis- 
tical resources in machine- readable form are first considered. The 



reason for this is that software are developed to process available. 

• 102 

* I ) 



108 



o 

4 

\ 



1 

file§. To address the scope and structure of required software this 



framework. is essential. 



From the present day and historical perspective, it is^importan^t 

to differentiate between publicly distributable and internal , confidential 

It' » 
machine-yreadable products,. Publicly distributable products include those 

\ • , , . c \ 

files which may be directly released to users outsi/de the Census Bureau. 

Examples of these files include the 1^970 Census summary tapes and publio 

use samples, county business pattern? files, intercensal estimates files, 

geographic base files, and machine-readafflte .technical documentation. „ A \ 

* ♦ * •» 
more comprehensive list of "these products available for sale is' contained 

* in Appendix A. • • y ]/ 

■ k. . - • r*-' 

Publicly distributable files include summary statistic, microdata, and 

* % fa ' 

geographic and^other reference files which are prepared for public dis- 

V * 
semination. To briefly review, summary statistic files are those files 

containing data items which are aggregates or estimates of the numberVv* 
of respondents with specified characteristics? measures of activity ' 
levels, or the number of events, occurring during a particular period for / 
specific geographic; areas. The common feature of these files is that of ' 
. the record containing an aggregate statistic for a variable corresponding 
tof unique geographic area. / 

Microdata files are. those files whiciTcontain data items corre- * • • / 
sponding to characteristics of an individual re$ponden^j£r respondent 
unit. Each record generally corresponds to an individual, household, j 
or other type of basic survey unit. In some cases* these files contain 
ratid scale data (such as the neighborhood characteristics 1970 (Spnsus 

103 




ERLC 



109 



V 




public use' sample) or famil)| aggrWates derived frdm person /ecords (such 

rev Annual tanographic File) . /.The /geographic 

r» 4 •• ' .// ■ •> - 

^tpvid 

arJa is 250,^00 population lor larger, j ^ 

J ; Geographic reference ]ciles( contain^ descriptive dat^ about; selected , , 
/ ' ' ' I I ' n • I 1 

geographic segments, or ardas.^ These f^les range in saJpe from the 



as the Current Population 
ire^ containing the responsje is jidehtif^ed on each recoryp^pyidfed: the » 



ioj/of 



a map with 



Geographic Base File (GBFj , a Computerized represen 
Records corresponding to streejt and non- street segnieiits, to the 1970 
' Census Master Enumeration District List, a hierarttj/ical listing of geo- m 
graphic areas and names for all geography larger/ than blocks. 
"'. These publicly distributable files" are to be ctytrasted -from the . 
confidential ;data files containing basic recoTQUs vdih individual iden-' 
tifying characteristics. Basic record files /cannojt be released to the ( 
public in accordance wit&"the title 13 provisions to insure confidentiality 

1 of individual information. However, basic ./record files are. the source ^of 

' * \ s / <% * 

many quite valuable special tabulations. ' As»a consequence, consideration 

\ ' . I \ 

must'^e~^iven software which can be usfed to prepare special tabulations 
on a timely and low cost basis. 

In review of the existing files available, to^ the public, two of the, * 
most significant problems which are ba'rtiers to efficient and effective 
t use>are file structure ancU technical ddcumentation. While these issues 

I 

' * 1 I ♦ 

are discussed mere extensively in a .later- section they should be 

1* • * • • 

torched upon here for an appraisal of software nee<fc. 

With regard to file structure, it is important to note that few 
structural and archiving standards ha|e been set and followed from 

• ' ' / -10) 



/ 



9 

ERIC 



statistical program to program. . Secondly, -by the very nature of 
Census Bureau statistical iiiles — containing extensive data corre- 

/" ■ - 

sponding.to hierarchical subject matter, and geography logical 

f 

records become quite long and are typically nested in a hierarchical 

* * • '■ . ^ ■ "* 

fashion. * As a result, user software developed specifically .for 
machine- readable products from any given^fc^tistical -program is * 
generally riot transferable to other statist icaNprogram prcx}Ucts\ 
In addition, due to long records and hierarchical file structures, 

^ \ ' / •• ' ; • . 

much of the conventionally designed softwjare is difficult, or extremely 
t expensive, to use without modification. An additional complication to 
the processing of many of the decennial files. is introduced due to the 
volume of data -- often precluding direct access methods and frequently 
necessitating complicated or time consuming file. extractions. There are 1 
a variety of other file structural problems that frequently prevent 
convenient usage such as location of satnple response weights, geocddes, 
record type codes, and others. * 

In the past, inadequate technical documentation and archiving has 
also been a problem. Sometimes users have been unable to effectively 
use or understand technical 'documentation. There have frequently been 
many assumptions made about the user's knowledge of the file contents^. 

An adS4tional problem has been the absence of a systematic approach «to 

\ - , / • - > 

the archiving procedures* for existing files. 

Some of these problems are solvable, some. are noti Looking toward 

> 

the future, we envision a continued increase in the amount of machine- 

' » - 

readable products that will be made available to accommodate *nore 



effective analysis ^ We are now taking steps to better identify, document 
and resolye those problems 'that something can be done about -- as will be 
summarized in Section Iv\ ' What is important to note here is that many 
of the problems associated with processing and analysis of Census .Bureau . 

machine- readable products is not user software in-and-of- itself but a 

» 

number of factors which create demands for' special types of software, 

. f 
which in. a sen,se are artificial,' as well as difficulties in understanding 

how to use these files. ^ I ^ . ° \j ' 

III^ REVIEW OF EXISTING AN D NEEDED USER SOFTWARE 

* ^ » 

To .the extent that the Census Bureau produces machine-readable ■ 
products, the Bureau has an Obligation to insure that users have an 
opportunity to make effective use^of the files. Consequently, certain 
types of highly transferable software will be. .produced by the* Bureau 
Where there are potential or exist ^ inadequacies^ in software otherwise 
available. In addition, as demands for. data continue to accelerate and 
volumes of data continue to increase, .there is the need to providp for 
the more^ timely dissemination of machine- readable data and alternative 
forms of access and manipulation. A partial answer to this, may be a 
computer based, ^terminal oriented, public data information system which 

will be discussed in Section IV. . I 

. % 

There has been a great deal of eiraihasis^placed upon the development 
of table generating softwar^ both within and outside the -Bureau to 
process surimary statistic and microdata files. In their most basic form 
these software are oriented* toward retrievaT^d display. Most Census 



Bureau summary statistic files are prepare'd in a tabular structure where' 

106 



tur 



Viz 



the cell^ corresponding to the rows, and columns are sequentially listed 

> ■ # ■ 

in the record. Table Oriented, retrieval and display programs with 

cross-referenced data descriptor files containing English language 

identifiers have been in high demand, i Tabular outpuV from these pro- 

grams may be generated for -user defined geographic area aggregates. 

% c 
There are frequent requests N for these, types ^of software to process files, 

sucji as county business patterns, simply as a means of displaying sjpe-^ 

cially aggregated data in a meaningful way. 



• "The tup most notable examples of this type of software, as applied 

r > . * 

to Census Bateau 1970 decennial products, are the DAUList program series^ 
prepared by the Census Bureau to process the 1970 Census suinmary tapes 
slqd the Data Use and Access Laboratories (DUALabs) ' ' 70 Series Automated 
Census Analysis System. A'v^riety ofjother" retrieval and display soft- 
ware are available that perform these functions, such as the Bureau of 
Labor Statistics Table Producing Language and Informatics MARK IV, 
although they are not specifically designed for Census Bureau products. 

. The Bureau has also developed COCENTS, a more generalized table 
generating .program, capable of displaying summary statistic data anfl 
developing estimates from microdata files and displaying these data 
£ tabular form. More recently, the Census Bureau has been developing ,' 
the General Tabulating System (GTS) in an atteifipt. to further generalize 
and extend COCENTS with the possibility o£ public aCcesjj, in. mind. GTS 
has been developed principally for internal use to proy.de* a generalized 
software system for preparation of tabulations and related analyses for 
statistical reports and tape files. While this system is\curreritly . 



in partial o^ration within the Bureau, it is not yet 'generally publicly 

\ 

access ible. N V 

■ » 

There have -been very limited efforts by tfift' Bureau to develop more * 

v 

analytically-oriented software which include, functions associated with 

modeling, parameter estimates, statistical tests, estimates and; pro- 

iections of variables, etc. With respect to conventional statistical 

methods for the analysis of relationships among variables, such as 

■• / ■ , r . v • 
"analysis of variance, correlation, regression, contingency table analysis 

factor analysis , ^ld other types of multivariate analysis , there v is a 

• large amount of software in ]p lace *to. meet user needs. However ^hefe 

are some needs 'in : this area. For example, there are no generalized, 

transferable* software available to prepare population estimates using 

the Ratio- correlation and Component Method IJ techniques employed by 

*•»•■ 

thfe^ureau. * * t 

An especially important type of analytical software needed for 
processing. Census Bureau products is that used to develop estimates and 
multivariate tabulations from microdata files, The reasons for the . 
impprtance are many: varying types of hierarchical records from micro- 
data file to file, variations in the type of weighting, scheme employed, 

• 7 . 

to develop estimates or/ estimates of their standard error, sheer 
magnitude of many files causing much of the conventionally available : 
software to be too -inefficient and costly K and others. Of the commonly 
s available statistical packages, probably the Statistical Package fo^ 
the Social Sciences (SPSS) and Statistical Analysis System ^t^AS) are 
used^re with these files than other packages. 'Certainly, there are 



108 



other packag^fe^Uable to the User, such 4s OSIRIS, which, also support 
the basic tabulating techniques. As a result" of the cost to use general 



purpose analytical packages, more specialized microdata file processing 

' F * J , - » 

software have been developed. Two df the most often used microdata file 

X * - . 

processing programs are CENTSAID , developed by DUALabs, and (X)CENTS 

/ • , > * * 

mentioned earlier* • 

* Important attributes of analytical software' for processing sunmary 

statistic files include the capability to develop aggregates, ratio . 

s'fcale tabulations, trends, and. compute standard errors (as applicable). 
■ t 

The structure of many decennial files imposes particular problems due to 

their non-rectangular structure, Oppression indicators, or the numer- 

ator or denominator for ratio scales located on different sunmary tape 

counts. For some files, such as the annual population estimates or 

county business patterns, a trend analysis problem is introduced as 

annual files 'are produced on a file separate basis. While indeed, SPSS^ 

SAS, EASYTRIEVE, and Other software packages' support the computational 

algorithms, they are very difficult to use in man^, cases due to Census 

Bureau file configuration. 

\ / .1 
The Census Bureau produces a great number of geographic related 

machine- readable products — extensively geocoded or geographic 
reference files. Many applications' involving these products have re- 
quired the development and- use of a variety of geographic processing - 
software. Several programs h^ye 7 been prepared to develop and maintain 
Geographic Base^Files (GBF) which are computerized representations of 

' ■ ■ ■ « 

metropolitan maps. These programs, for the most part, have been . 
• • 109 



designed to permit their use outsider the Bureau. ' This series of GBF 
programs does not include analytical software. ther.e have, however, 
been efforts by the Bureau to develop software which could make GBF's 
more useful to the user community. ' ^ • 
Several distributable ^programs ha$e been prepared* to permit rather 

f 

specialized jecord- linkage, matching, a^nd merging applications. ADMATCH 
was the first distributable program^ of this type,. originally developed 
for use with the Address Coding Guide and the GBF to proviSq^the 
capability of geocoding computer readable records containing- street 
addresses. UNIMATCH was subsequently developed to provide a more > 
generalized record- linkage system by employing a^ttser specified t 
matching algorithm. ZIPSTAN was developed as an auxiliary program 
to work with UNIMATCH to prepare standardized street addresses and 

': . ♦ 

> ** ,» 

add match keys. These types of programs are of principal importance 



in matching records to a specific geoWaphic segment, lor „area,*whic! 

S - < j A m P 



can then be tabulated as a s 



'•V '• 

ummary^ stat 



statistic for the area, or. for*, 



an aggregated set of areas. A^sig^iificanl useK|iroblem wi^t/theSe 



specific programs is that they* were programmed iivIBM Assjmnl(er 
restricting their transferability. f^S* ' 

The Geographic -Related Information Display System (GRIDS) # was 
develop in 'the; early 1970* s by the Bureau provide a fairly 
genera?, computer mapping capability to display data by geographic 

t ♦ 

area.. Records processed by GRIDS contain. data values to be mapped 

i * 

and their corresponding x, y Coordinate. GRIDS has been. used 

extensively with GBF's. f\ • 

1JL0 



1 



9 

ERLC 



r 



He 



* : • . V l . •■ • ; : ■ • 

In an effort to bring these programs * together into one system, 

V' K . ♦ ■ /\ ' 

the Comprehensive Manpower Planning Information System is being 

developed by # the Bureau for the Department of Health, Education and 

Welfare, this system incorporates the use of Zp>STAN,'UNIMATCH, a 

Geographic Base File, and an address-oriented data file to develop 

• % * j 

an address -oriented, data' file wJtK ^pcodes and x^ y coordinates. 
This file is then aggregated 'info a'sildftary statistic file using 
COCENTS, or may be processed by GRIDS to develop value and density * 
maps. The. system also makes use of the DIME Area Centroid System 
(D/VCS) to develop a boundary file from the GBF. ^Theboundary file can. 
\hen be used with the summary statistic file by another ]£»gram (SCAhMAP) 
to prepare value and density maps corresponding to data on the summary 
statistic file. ' 

Like those described above, most applications of Census Bureau 

i 

machine-readable data, involve the' use of other, non-Census Bureau 
machine-readable data. This- process almost always involves the use of 
specially prepared programs to develop integrated* files. In the most 
typical application, parts or all of two or more files are merged to 

i ft 

develop a file which is then used in some type of analysis. This part- 

< v ' ' ' 
ticular process is frequently the most ..time consuming and expensive 

phase. In more systematic approaches, data bases are developed from a 

variety of sources and maintained over time, such as a health planning 

data base. , f 

During the past several years there has been rather considerable. 

* r ' 

interest in developing computer based information systems which make 

•', / , 

111 1 . 



use of Census Bureau data files, geographic reference files and non- 
Census data 'to serve as a data base for continuing analysis of socio- • 
economic behavior in a particular area. These data system have been 
associated with computer software, usually of a Unique nature, desi&ied 
to tabulate, display, and anajyze trends. There are a great number of 

such efforts that < have been undertaken in the private,^ public Sectors. 

• . *, ' - . * 

The Census . Bureau provides limited technical suppor^role in assisting t 
• Federal and State agencies to develop such systems where they can be 
cost-effective and useful! 7^ - • - 

Data base management systems have generally not been utilized to 
support Census Bureau machine- readable products outside, the context 

of computer based information system?. The reasons for this are many; 

... ^ 

but the main ones being that they cannot be applied by the average user 
in a cost-effective way and that these systems, are oriented toward 

* 

transaction processing which does, not normally apply to the typical • 
uses of Census Bureau products. We would not^anjllcipate a change in 

I Or.- 

this situation. However, a data base management system might be very 
effectively applied internally to the' Bureau* s processing which could 
greatly facilitate the user's ability to access and use the data through 
a system such as an interactive public data information system whp.ch is 
discussed in the- next section. V • 

Of course many users have developed software to meet their own 
: needs : f or .^processing and analyzing Census Bureau and related files . 
that may be of use to others. In the past the Bureau jhas attenpted 

to promote a clearinghouse for the exchange of information concerning 

112 '* 



such software,* and in some cases supported the distribution of, the 
software itself . One example" of sucli software distributed by the v 
-Bureau *is the choropleth mapping routine > (C-MAP) developed gfr the 
yniversity of* Idaho, fhis clearinghouse function has not been used 
extensively by users; however,,, we plan to continue efforts „ along these 
lines. . * '• • v . ' ' . • 

In summary, there are nSeds for specialized user software for ■ . *. 

processing' and^anaiyzing. Census Bureau machine-readable products . Most 

< . * '■ t " ' * ' 

* of the existing user software now available to .meet these needs has. 

- been developed outside of the Census' Bureau. 



IV. UNMET NEEDS/. 



A major problem in assessing unmet needs for user software is the , 
absence -of obj ecti\^ data on this issue. Cur impression of the needs 
for software and problems involved in accessing and using machine^ 
readable data is based upon extensive contact between Bureau staff and 
major machine -readable data users, our own staff's experience in 
-processing and analyzing these products* both within and outside the. 
Bureau, and limited feedback from the more general user community. Too 

- of ten recommendaticihs from this latter sour.ce do not©prove useful, due to. 
failure to consider kfey problems such as volume, dominant types of use, 
frequency of access, etc. . , 

'. > ' This section provides an appraisal of unmet needs for user software 

and other user aids for processing machine-readable products based upon 

the information we do have. It also outlines current activities, and plans 

113 < 



ERLC 



1 1 Q 

► ^ K.f 



i . 



that are underway to address these needs. ■ The unmet need for software ijf. 

■ * • . . »• 

of two dimensions: distributable software and. access to an interactive 

■ . : . ' 

time- sharing system. In addition, the apparently unmet needs for con- 

* , .-. * 

venieat processing of machine-readable, products include improved tech- , 
nical documentation, file structure convention standards and practices 

t 

:£or archiving, user orientation and training, and other user reference^ 
and technical aids.' / . > 

Distributa ble .User Software \ ; 

— ' * . "■• — ~~ . ' i " 

There are at least two types of issues to be iiadressed for distri- 
butable user sdftware — transferability of- the software from- system to. ' 
"system and the type^ of function served by tKe software. Ot^er issues' 
that might be addressed include user convenience, costs for acquisition 
and use, ease of modification, etc. Cleanly, application software de- 
scribed in this section" might also be interactively accessed depending 

upon demand and cost-effectiveness. ; 
* ■ A . . . 

Despite the conventions for developing software as set forth by , 

the Federal Information frocessing Standards (FIPS) , software currently 

r • 

available from the Bureau does not entirely conform with standards. As 
a result: some of the softwafe ; is not as transferable from system to 
system as might be possible. .An example of this problem is with UNJMATCH 

"which was programed in IBM Assembler. Thus, one unmet -need to be 
addressed as. additional software ts developed is to conform to standards 
for developing and documenting software that promotes maximum trans- 

, ferability. In cases where this mayjiot be feasible > two versions of 

- a particular program or system could'^Se' developed. 

„ H4 . ' - . ■ 



Turning' to the second issue, several unmet functional area needs 
for distributable user software xan be identified. As we look toward 
the 1980 decennial program, we are now considering the preparation of 
basic retrieval and display programs with increased capabilities over 
those available for use with th^ 1970 decennial products. Based upon a 
1976 survey of summary tape processing centers, 78 of the 96 responding 
centers indicated that the Bureau should develop software for use with 
the 1980 decennial files. -Forty centers suggested that the software 
. should haver improved capabilities over the 1970 DAU^ist programs. One 
major inprovement might be the development of more generalized table 
generating software so that it could be used with any summary statistic 
file produced by the Census Bureau provided the appropriate machifle- 
y* readable data descriptor file has been developed for the particular data 
u » file. If we proceed ahead with the development of this system it will be 
underway by early fiscal 1979 and may also be of use with the 1977 eco- 
nomic census products. 

As described earlier> there is a great deal of software already 
. available for performing conventional\statistical analysis. The need 

♦ 

in this area is for specialized analytical techniques but generalized to 
meet the needs of a variety of users -- such as market analyses or proc- • 
esses for developing estimates and projections. r '' 

■ County Business Patterns can be used to demonstrate this need. The 
County Business Patterns '(CBP)data are the only source of non -proprietary, 
annual, county- level data containing employment and payroll characteristics 
of establishments at the 4-digit SIC level available on a nationwi4e basis*. 

. * * • •» 

Tf , K ■ 115 

. .., ' , ' ' ' / 



■ Lr- 



ERIC 



Business firms seefc to determine geographic market concentrations, either 
as inputs for their production processes or potential markets for their 
output. Typically these analyses involve aggregation of employment, 
payroll, or value of products for one of more 3- or 4-digit- level in-~ 
dustries and the ranking of the top few counties, SMSA's, or special 

t 

market areas comprising most of the market. As substantial variations 
in product mixes might be analyzed by a giveii firm, manual analysis can 
become, prohibitively expensive -- particularly for smaller firms. In 
the public sector, State and regional planning and economic development 
agencies analyze county- level economic activity, frequently requiring 

♦ 

♦ • 

detailed industrial data. Applications involve developmental planning 
to strengthen or expand the existing economic base of an area as well 
as to provide site location information to firms "which might potentially 
locate within the State. To meet these needs the development of .an 
industrial 'analysis program is being considered. In its most basic 

» 

form,, the program would^repare a tabular and graphic display to 
analyze the top ranked are\as *(e.g. , counties) b)%employment as specified 
by the user or areas containing user specified percent of market as * 
measured t>y employment for any combination of SIC's. x 

Similar 'analytical software are needed for other machine -readable 
products. Another excellent application area would be the monthly con- 
struct ion series C-40 housing building permit/authorizations data. At 
present, however, the problem with these files is more basic -- the 

V 

files -are not properly structured, documented, nor conveniently available . 



cumer 
it be 



In* addition, % some consideration is being given #o development of . 



116. 



i 



12 Z 



general purpose estimation software. Emphasis is now focused, upon popula- 
tion characteristics although the needs exist in a variety of other areas. 
Severaijopulation estimation programs exist internally but are r not as 
transferable, generalized, nor documented as the general user community 
requires . Some efforts are underway to make some of the popu^tibn esti- 
mation software more available. ' 1 



Interactive Time-Sharing Systems • 

Up to this point the focus of the paper has been on -publicly dis- 
tributable software. Due both to the state-of-the-art of computer 
hardware- software and to user needs which can best be served through 
alternative methods* of data processing, dissemination, and use the 

paper would be incomplete without considering the possibilities of an 

. • •» 

interactive time- sharing system. To this end, the Bureau will be under- 
taking a study during the next year ,to determine the feasibility of * 
implementing a computer based, terminal -oriented, public data informa- 
tion system. In its fully developed state, this system would afford access 
to all users to Census Bureau public use data through their own terminals. 
The system, undergirded by extensive documentation, training courses, and 
related user assistance, would provide a wide range of retrieval and 

display, analytical, modeling, and other capabilities to extend* the . 

• » ■ 

usefulness of both the data base and the system. Cost-effectiveness 
and the improvementMshat can be made in extending data dissemination • 
and use of Census Bureau products would be the key considerations in 
determining whether or not to implement such a system. 



The system as envisioned would be in some respects similar to they' 
CANSIM Interactive System developed by Statistics Canada. CANSIM is an 
interactive, on: line system which may be accessed by users of Statistics 

Canada data. However, there would be rather considerable differences 

V 

in hardware, software, operating characteristics, and the scope and 
size of data base supported. 

■ t 

Through an on-line system users would be provided considerably 
improved access speed and convenience and support of an interactive 
dialogue for problem solving and analysis. They would also have 
greatly enhanced abfi^ties to perform comparative (geographical or 
♦distributional) analyses for quicker and less costly interpretation and 
inference. Statistical e/stimates could be quickly developed from micro- 
data i&Les. The need fofcf some printed reports, or selected sections of 

1 1 . . ' * 

tabulations, might even \Q eliminated. 

' i 1 1; 

Additionally, interactive facilities would provide procedural ^ 
problem solving, preprogrammed self-help and tutorial aids, and other 
user aids oriented toward inquiry-response such as subject content 
indexing functions. A subject content/geographic data indexing system 
could be maintained to assist users in locating required data. A com- 
prehensive bibliographic system could be maintained. A message system 
might be established to keep users apprise^ of developments and problems 
* regarding Bureau products. Through computer assisted instruction, users 
could obtain instructions to assist them in accessing, interpreting, or 
using th^data for a particular type problem. Thus, this system could 
provide not only Approved data delivery but also a user education and 
technical assistance function. 

¥ 8 



Of course, there are presently interactive systems making extensive 

use of Census Bureau statistical resources. The most notable examples 

have been developed 'by private service bureaus and universities. These 

^^ystems are both ^special and general purpose. However, for- the more- 

generalized ones, existing public and private efforts in this area 

exhibit both incompleteness and user inaccessibility. The more gener- 
al * 

• alized systems have not provided sufficient revenues to adequately 
support and expand them in the private sector. As a . consequence , only 
selected subsets of data are maintained and technical services are not 
readily available. The profit incentive forces distortion in equal 

♦ 

accessibility of such services to all potential users even in quasi- 
,^-^jblic entities. Experience with interagency funding projects pas been 

less than satisfactory. Private and Federal foundation support has 

i . 

A provided tor some research and development, but in general has not estab- 
lished a basis for a continuing operation. These considerations are ( 
some of the reasons leading the Census Bureau to consider the imple- 
Irtlentation of such a Systran. 



Technical Documentation and y^chiving 

As outlined, InVSectiort I,. a major need relating to the usability 
of machine-readable product is improved technical documentation and 
standards and practices 'for archiving these products. In the case of 

.,technicai3bcumentSttipn, many riles have been made available in the past 
with little more than a record layout. This, of tourse, leaves many 
unanswered questions for the user ranging from precise definitions of 
subject content tat^Uatfions for specific fields to methods of estimating 

- summary statistics . and theirVreliability from microdata files. 



Steps are now being taken to improve technical documentation both 
in terms of comprehensiveness of the documentation — record layout, 
file structure, definitions, data file dictionaries, estunation pro- 
cedures, control count tallies, etc. — as well^as standardization of 
the formal technical documentation. We are moving toward more systematic 
use of machine- readable technical documentation which corresponds to v 
standard conventions for naming and identifying fields, describing universe 
edit processes, identifying and defining valid codes or ranges, etc. 
During the past year machine -readable technical documentation has been 
prepared for several publicly distributable products. 

A second major problem has been the absence of standards and 
fractices for archiving machine -readable products. One result of this 
has been the lack of comprehensive inventory of machin^readable products 
available. Files prepared as special tabulations for. narrowly defined 
"uses are frequently not documented nor archived for subsequent dis- 
semination. A more substantive problem has been the lack of a systematic 
approach to verifying the accuracy o$ data file contents and then developing 
a master backup copy. While this has been done in part for many of the 
major files, such as 1970 decennial summary tapes and public use samples, 
products in lesser demand have not been given the same attention. An « 
additional problem in this area has also been a lack of standards for file 
structure ranging f rom, source of geocodes used to field within record and 
record within^ file conventions. . . . 

We are now taking steps to improve archiving standards and practices. 
• During fiscal 1978, we are developing a manual outlining conventions for 

120 



1 \ 4' 



developing and maintaining distributable data files. 'Procedures for 
developing machine-readable file documentation are being standardized. 
In addition, a tape file/computer software inventory is now being 
developed which will be frequently updated. An inventory identifying 
products prepared from special tabulations since 1970 has^ been developed 
and will soon be made available to the user community. 

User Reference Aids >. 

* 

Even when standards have been applied for developing files so that 
they are processible by conventional software, and they are well described 
in an inventory process and technical documentation and readily accessible, 
many users lack- the required knowledge to make effective , or even correct , 
usage of the files. Indeed, the lack of user reference aids which provide 
Jasic, or cookbook, approaches to the use of these files creates a 
barrier in some cases resulting in the user lacking a desire to acquire 

« 

tfce files or understand how they can be utilized. 

As a result, additional user aids need to be developed targeted 
toward specific usej> groups or types of uses. These products may be 
as basic as describing how to develop aggregates or ratio scales 'from 
summary statistic files to methods of developing multivariate fre- 
quencies from microdata files and analyzing cause and effect or other 
types of relations between variables, ^ 

User Training and Orientation . 

With the increasing number of machine -readable products becoming 

available, new developments in software, and increased interest in 

121 



J 



s 



these' products by a iarger number of users, it is evjident that there 

,> 

is' also a need for increased promotion and marketing of available 
products. In addition, more user training should be provided to 
familif^ize users not only With available files and their charac- 
teristics but also methods of making use of the fries for analysis. 

In the past, training opportunities for users provided by the 1 
Census Bureau have been restricted to learning what data products are' 
available, how to acquire thejn, and how to locate specific data t»n- 
tained in them. More attention is now being given to how to use the 
products. Courses are now planned on assisting users specifically 
with the use* of machine -readable products. 



V. SUWARY 



be fori 



The matter of primary ^importance that should now be farther dis- 
cussed and analyzed is the /general issue of how to improve the 

* 

accessibility and usability of Census Bureau machine-readable products. 
To consider only the availability of and needs for 'user software, while 
a critically important issue, focuses too narrowly on the larger* issue. 
As stated earlier, most difficulties associated with processing and 
analysis of Census Bureau machine -readable products goes beyond user 

0 

software tc^include documentation, file structure, user training^ I 
reference materials and other user aids. These factors create dej-lands 
for special types of software -which in a sense are artificial. In 
addition , their absence sometimes leads to incorrect use of the files. 
More extensive user dialogue bn these issues As needed which can be 

122 



9 

ERJC 



Li 



128 



applied to make these files easier to use as well as more useful. 

However, some needs for user software can be identified both^ 
terms of distributable user software and a more comprehensive system 
for accessing" Census Bureau statistical resources through an inter- ♦ 
active, terminal oriented, system.*" * ^ 

• • • ■ • i •■• 

0 



V f 



) 



123 



• 129 



} 



Appendixa i 

♦ ' ■ 

MAJOR CENSUS BUREAU DISTRIBUTABLE 
MACHINE-READABLE PRODUCTS 



I . SUMMARY STATISTIC FILES 



/ 



v ■ ■ 

1970 Census of Population an)H*xising 

First Count 

Second Count 
i Third CoUnt 

Fourth Count 

Fifth Count 

Sixth Count 

PC (2) Subject Reports 

Population Centroids 

Adjusted County Data 

County Migration 

Special Tabulations 
1972 .Economic Census v 

Manufacturers. 

Governments 

Retail Trade 

Wholesale Trade v 

Mineral . Industries 

Selected Services 

Merchandise Line Sales 
1969*Census of Agriculture - 
1974 Census of Agriculture 
Revenue Sharing Population and Income Estimates 
Federal- State Cooperative Program Estimates 
County and City Data Book | 
County Business Patterns 



/ 



II. MICRODATA FILES 



1970 Census of Pppulation and Housing 
Public Use Samples . « 
Special Tabulations. 



124 




Appendix A (cont.) 



1£70 Census Employment Survey . 
1960 Census of Population and Housing 

Public Use Sample 
Annual Housing. Survey 
Survey of Income and Education 
Current Population Survey , 

Annual Demographic File 

Special Tabulations 
Survey of Purchasers and Ownership 
Survey of Scientists and Engineers 
Survey of Government Employment 
Survey of Government Finances 
Truck Inv^tory and Use Survey (1967, 1972) f 

» j? 

m 

' II*. GEOGRAPHIC AND OTHER REFERENCE FILES 

•v ■ 

1%70 Census of Population and Housing 

Master Enumeration District List 

Address Coding Guide 

Urban Atlas -Tract Boundaries 

ZIP-Tract Cross Reference File 
Geographic Base Files 
School District Geographic Reference File 
County Group Reference File* i 
1972 Economic/Geographic Reference File\ 
Area Measurement File * 
City Reference File 
PICADAD 

DIMECO t . 

Spanish Sdrnames File 




CENSUS SOFTWARE NEEDS OF STATE AND LOCAL .GOVERNMENTS 

, ' ■ • 

HAROLD B. KING * ' 

• - . / <?•• v • y 

THE URBAN, INSTITUTE ' 

WASHINGTON, D.C ; 

. > ' i 

o 

...» 

■ * ■ 

To address the software needs of st^teand local governments 

\ • 

in using census data, it is necessary to initially make some very 

' \ * ■ ■ 

broad generalizations. The first of these is that the computing 
capabilities of these organizations vary from very sophisticated 
to non-existen t . The second is that if we address the needs of 
these organizations by fobuaing on the data they will be seeking 
from the Census, we will be able to infer something about their 
software needs. The thirds/ and broadest, generalization, is 
that all of these organizations have similar software needs and 
differ only in t^ieir computing capabil s to process da»ta and 
their levels of sophistication \- in analyzing it. 

The last "statement suggests that the proper approach in 
assessing the needs of a governmental unit might best be based 
o+i size rather than type. Studies have shown that there is a 
very high correlation between the*size of a governmental unit, 
its computing capability and its analytical sophistication. • 

COMMUTING CAP/VBILITY 

A report by the International City Management Association 

i 

(ICMA) states, "Although there has been considerable growth in 



126 



132 



computing adoptions in cities, computer ciapaaity is not very 

great except in the largest cities (500,000 population and 

over). " The overall scale of EDP usage, which can be assessed b/ 

amining the total number of opera tionfcl applications in 

t 1 

cities, is directly related to city size\. H Based on the 
1972 County and City Data Book , only 26 cities fell into the 
500,000 and oyer category* * 

A similar report on counties states that large scale 

* ■ * ~* 

computer facilities normally/ occur only in' those bounties with-. 

» ' & 

populations over Z50,Q00... In 1972 there were 150 .counties with 
populations greater than 250,000. N 

It would seem then that our major target user group would 
be comprised of 50 state governments, 150 counties and 26 
cities, or 226 governmental units. The 1972 Census of Govern- 
men ts ^ indicates that at that time there were 50 states, 3,044 
counties and 35,408 municipalities and townships* Based on * * • 
population si^e alone, pur major' audience would thus be comprised 
of only 0*59 percent of thp total. 

Another way of looking at the computer cap abilities issue 

I 

would be to examine the . types of EDP tasks performed by these 
gov'ernmen tal uni ts • James Danziger , in V Computers , Local 
Governments and the Litdoy to EDP", 4 develops a typology of 
EDP tasks which one might fdfnd* useful A\^describih.g' the types of 
processing performed by a loca^ government* Of interest for 
this discussion are two of these types: record re-structuring 
and sophisticated analytics. 



1., Record ' W-struatuTinq . this typ-e of tVsk is. related to 
the re-strqct'uring and ref aggregating of ■ records* It 
indicates a le-vel ©fi aophisticat ion at aNLocal govern- 



ment EDP 'operation , which suggests they would be 
t capable of reformatting census' trapes and* per forming 
simple descriptive statistic's* > on the file such- as 
ctosstabs, frequency count s and aggregations. 
'2. Sophisticated analytics . Danzigtr de.f ines. WTis as a / 

*' typfe of activ ibf> which includes simulation studies,- * * 

' " v . :-' v - f ■ • v 

rpg^te^ion models and geo-coded data bases.. * In general, 
'these 'applications utilize sophisticated mathematical • . 

*• ■ % ' S^-" - ' * T r ; ■ • 

•\ . » methods- 6r* spfecigl technical capabilities of, trW 
^compute* t<f^xamine data. 
\ in 'the abtive-'cit&d ICMA studies, cities of 50,000 or morje 
responding to a survey Indicated* that r'ecord re-structuring * 
comprised only -6 p ep -pe p _t of their total operational applications, 
and. sophisticated 'analytics comprised only 5 percent. The . . 

results "for 'counties o£ 100,000 and over Were similar. Of those, 
counties responding, 'record re-sJtVuc turing accounted for 7 per-^ 
c^nt of total •applications and sophisticated analytics accounted 
for' 4 percent, « o< » .. ' 

These ^uTv^y results would suggest that t v he majority of 
count ies-dnd other^ocal * go v'ernmrynts have neither the computer ^ 
• resouTcps nof the analyticaj capabilities to develop software to 

acce,s? complex census data files. This conclusion -is further 

. \ r< 

si^ppdlFted by Jfche' tesults of a sQrv.ey conducted by the Public 



n. # * 128 V 



\ * > 




Policy Research Organization (PPRO) and reported in flation 1 s 
ie3 . The ^ survey ^s howed that most chief execo^ives felt v 
the greatest problem associated with data processing in* 

4 

their local governments was that data they needed for the . • 
analysis of specific questions was npt available to them* They 
felt that the data they needed was be ing v collected and stored, 
but that their computer systems were not integrated ejnough « * 
to present summary data to top management. Kost local governr" 
nrent computer applications' continue to serve only clerical 1 arxd 
information retrieval needs of i'ndiv idual departments ^and * 

agencies, #• # 

♦ ' ■ ■ . " ' 

The picture at the state level seems to be much brighter/ 

The 1976-1977 Report, on Information Systems Technology in State 

$ 6 - • - 

Government identified 603 computers in use at* the state level 

in 1976. (Florida did not report in 1976 but had 20 machines 

* listed ih(1975.) These machines ragged from some of* the largest 

machines commercially available to mini-computers,, with almost 

t all of the major manufacturers represented, Of t(je 49 stdtesv 

reporting ^in 1976, twenty-three reported having len or mo^e 

t - computers :These varied from^a high of 40. in Ne^W York aitate to/ 

a low of one in Wyoming. r 

^Computer appl ica tions were also varied. Uses ranged from 

^N^river Licensing to Resource Management. But, again> the 

majority, o,f the computer applications tended to* serve clerical 

- ' l ♦ 4 

anl* irif ormiatian retrieval needs oT state department's and agencies 

Wfiat all this says rs that there cpn b,e*-no one software 

"V ■ 129 > 



solution to meet" the census data processing needs of states and • 

local governments. * 

Some units will have highly sophis t ica tecf pro^fcssing and 

analytical capabilities, and wUl be capable of developing their 

own software . 4 » 

> 

(tthers' will 'be more .capable of 'using sophisticated software, 

and would be more than happy to receive a fully tested and well 

9 * 

documented software package from the Census Bureau* 

v 

At the other'end'of the spectrum will be local governments 

which will have no computing capabilities or will require - 

t 

extremely simple software to generate descriptive statistics 

t + 

from sm-all area da,ta available from the Census. 

To meet this latter demand, The Urban Institute has devel- 

1 

oped a simple multiple crosstab program. This program was 
created to help local governments analyze survey data which they 
had collecfed in order to evaluate governmental operations. 
Although the package was well documented^ and the instructions 
* for using i't wey simple-, we found it necessary to supply 
technical assistance to the users. The Institute's experience 
suggests that any 'fcensus program established to meet the demand 

> 

for this type of software will hqve to be supported by a group 
which will provide technical assistance for the installation and 
use of such software. 

Along similar lines, a conference was held at The Urbtfn 
Institute in 1971 titled, "Work-shop on Census and the Cities." 
Its purpose was to determine the type of assistance needed by 

: * 

130 / ' 

* » 



EM! '7 13g 



local governments iji proceeding the 1970 Census of Population 
and Housing. In attendance were representatives from both large 
and small local governments, consulting organizations, arid 
the U.S. Census Bureau. 

The hypothesis around which the conference was formed was 
that there was a lot of useful data in the 1970 Census, and that 
a coordinated effort by a few foundations and non-prof it organi- 
zations could re suit in software products vihich would make this 

4 

data .readily available to local governments. The concept was to 

survey the (lata r>eeds of local governments , and use the resulting 

) 

information to determine how best to meet those needs. 

Tjie survey was never performed because the general consensus 
of the meeting attendees was that most local governments would 
find it difficult to specify their data needs as they related to- 
census data/ Instead, it was felt that a massive training 
program would have to be mounted to inform potential local 
government users, about what was available and how it could be 
used to answer questions and solve problems related to their own 

1 

-govpT nmen ts\ 

The attendees agreed that such a program would be extremely 
costly and would probably need a'large government subsidyj>_To 
the best of my knowledge, nothing further was done along these V 
lines in assisting local governments directly. 

'I As a rfesult, small local governments which attempted to use 

f. { 
^machine readable census data found the going rough. Most of the 

available' foundation money was used to suppor.t softwap4 develop- 

men t to meet the needs of universities, research organ izations, 
* 131 



and large governmental units. Little was done to develop 
capability at the local level outside of regional presentations 
by the Census Bureau. 

Since the Institute meeting, some work has been accomplished 
in attempting to determine the data processing capabilities of 
local governments*- The ICMA and PPRO surveys mentioned earlier 

have been part of this.^ 

i 

A general conclusion which can be made, then, is that the 
majority of the local, governments at this time are not able to 
make use. of census-developed software products. Even though 
thisVis the casfe, all of these g0<6"rnments have a need for a 
mechanism which will insure titoely and easy access to this data* 

CENSUS DATA NEEDS 
* : 

i 

/ 

One of the major reasons states and local 'governments need 

* * 

timely access to census data is so they may evaluate it and 
determine its accuracy. Many programs whicji make monies avail- 
able- to these governmental unit's are based on head counts and 
housing unit counts. If these numbers do not appear accurate, 
the states and local governments will be pressing for recounts. 

. Another major heed for data will be f or 'redisricting 
purposes., Manjy entrepreneurs had anticipated a heavy use pf., 
computerized redis tricting software in the 1970*8. This did- not 
materialize because of the extremely political nature of this 

s 

p£oc#88. The cost^ involved in the use of this software and its 
related data bases also discouraged many from attempting it. 

132 

*' * 

, • i 



J 



All^ftates and^many local governments wilJL need social and 
economic data to, support applications for grants. Formula 
grants in particular require the availability of accurate census 
data. The formulas are based on* such data as population, income 
level s , and freed, . or a combinat ion of thtse ^ factors . 

The fiscal 1976 formula for Title If of the Comprehensive 
Employment and Training Act (CETA) of 197^3 is a good example. 

?! 

A three-part formula was used to determine |the allocations; 50 

* ... 

percent Was based on last fiscal year f s allotment; 37.5 percent 
was based on unemployment; and 12.5 percent was based on the 
- number of adults in low-income families in each prime sponsor 
area. ^ 

In addition, a local government must have «a populfltioto of 
100,000 or more to be eligible for a grant. For governmental 
units close to TOO, 000 population, accurate population statistics 
will be most important, (.assuming they ind ica te a^ population 
greater than 100,080). 

There are a nuinbe r of other grants besides CETA which use , 
formulas for determining eligibility and allocations. Some of 
these are: 

1. Community Development Block Grants 

2. General Revenue Sharing \ 

3. Special Food Service Program for Children 

\ ' % m 

■ t ^ 

LEAA - Comprehensive Planning Grants 

5. LEAA - Improving and Strengthening Law* Enforcement and 
Criminal Justice 

6. Industrial Development Grants . 

133 



IerJc ... 



J 



7. Urbafi Mas» Transportation Capital and Operating 
formula Granta 

8* Highway Research , Planning and Construction 



9.; Low Income Housing Assistance Program 

J 

Data is *also needed by state and local governments for such 
programs a/8 urban renewal, housing code enforcement, community 

•7 

action, afid resource allocation* Resource allocation -includes 

locffting ^schools , fire stations, police stations and community 

service (tenters* 

Allocating education&l resources requires information on 

family incomes, status, children *age groupings, and adult 

education levels. , 

In reviewing the needs' of states and local governments for 
<4 ♦ 
census data, a few issues stand out clearly. One of* these is 

> \ 
that much of the data produced by the Census Bureau is aggregated 

v 

to a unit (i.e., tract, block, block group, etc.) which does not 
relate to° local boundaries. 

» 

Local ^planning units, on the' other hand, need data available!^ 
at sgch lQ.cal data analysis levels as school districts, redevel- 
opment areas, congressional districts and traffic areas* The 

* 

availability of geo-coding and addres^ ma|ching schemes have 
aided in the use of census data, but tfyey are still expensive 
methods for solving problems-* 

• ■ J • 

A<more orgaj>lzed approach to de ternftnipg. data needs of 

local governments might be arrived at by identifying: 
t 

1. Departments which may be major statistical data 
users 

2. Functions for which data are needed 

134 * 



ERIC 



3. Data anaLy8i8 areas 

> 

4. Uses for data 

5. Data types 

The following lists are not exhaustive, but they are a* good 
i ndica t£o n of the broad areas of statistical data needs which 



many state and local go vernments /fiave • 




Departments 

General Administration 
Personnel ^ 
Planning *\ • 
Utilities / y 
Police ' -\ 
. Inspections 
Public Works ~ 
Welfare 

Manpower | 



Finance j 

Budget / 

Housing and Rehabilitation 

Education 

Fire 

Zoning '. / • 

Traffic and Transportation 

Health ■' \ * 



Functions 



Industrial Development 
Health 

Public Safety 
Employment 
Education 
Land Use 

Urban Redevelopment 
Migration 



Public Workl and 'city 

Engineering 
Welfare 
Recreation 
Tran8portaJtion 
Commercial; Development 
U^ban Planning 
Neighborhood Devel opmen t 
Housing 



Data Analysis Areas 



States 
Coun ties 
Municipalities 
Townships 
♦ School Districts 
| ire Districts 
Police Districts 
Traffic Dist ric ts 
Wards - J 

Streets 

Stree.t Segments 



Blocks 
Block Sides 
Households ' ' 

Census Tracts ' 
Regional Planning Districts 
Redevelopment Areas 
Standard Metropolitan 

^Statistical Areas v - 
Soil Conservation. Districts 
Flood Control Districts 
Census Enumeration Di|tricts 



135 



Plan New Facilities 
Plan New Programs 
Estimate Size of " Clientele-. 
^Estimate Needs of Clientele 
'Anticipate Staff Needs 



Policy Evaluation 

Program and Project Evaluation 

Support Project Proposals 

Continuing Research 

Support Grant Applications 

' / ' ■ 



Data Types 



Voting Records 
Welf are Records 
Police Records ) 
Marriage Records 
Birth/Death Records 
Individual Case Histories 
Union Records 
School 'Census Data 
Employment Statistics 
Income Statistics 
Hospital Records 
Traffic Data 
Migration Data 
'Street Location/Number s * 



Family Characteristics 
Land Use Da ta 
Insurance Data 
Population Densities 
Population Projections 
Housing Characteristics 
Utilities Data 
Tax Recprds 
Federal Reserve Data 
City Engineering Records 
Land * Values * 

Fire Records V 
Housing Market Data . 
Air Pollution Data 



Whether a governmental unit uses any or all of these depends 
somewhat on size and authority. For example, only about 14 of 
the largest 43 cities in the United Stated operate welfare 
departments, as this is predominantly a county function. 

The Census Bureau, in an attempt to clarify user needs, held 
a series of open public , mee tings . These meetings were sponsored 
and organized at the local level, ancj^cpnducted with joint parti- 
cipation of local persons and Census Bureau staff members. Held 

i 

between Qctob.er 1974 and 1975, the meetings were conducted 

in 73 cities covering all, 50 states and the District of Columbia, 
with, over 6,000 local participants. Ip a "Synthesis of Local 

Q • 

Public Meetings'," the Bureau presented an eleven-page 



136 



U2 



description of data items and their tabulations which were 



compiled from these meetings 



The same type of meetings wa^e held with state agencies . 

Theire were 16 regional meetings, of *this type, and all but two 

states (Arkansas and Colorado) had representatives at these 

9 

meetings. In a "State Agency Meetings Synthesis," the Bureau 
again compiled an el even- page description of data items and 
tabulations suggested by the participants of these meetings. 

The listings from both of (ihese rjeports are too numerous to 
be duplicated here, but they do support the hypothesis that the 
data needs of states and local, governments are similar. 

COMPUTER SOFTWARE NEEDS 
The discussion so far has pdinted out the wide rlnge of 
computing capability from the largest state to the smallest 
local government. It has also presented an *abb rev iat ed descrip- 

' • • v. • • • 

tion of data needs. Wit^ these computing capabilities and data 
needs in mind, we can now turh to computer software needs. , 

r 

It is difficult to describe compute r < software needs of 
states and local governments under the categories assigned to — 
this conference: . Da ta^ Organization; Data Tabulation; and Data 



iny\?ii! 



Presentation^, In many vcases an item could easily fall into two 

o^ three categories. There are also needs which do npt f)all 

into any of these ce^egories . With this caveat , an attempt will 

be made at^ categorizing these overlapping needs. I tem^ which 

set^lR^not to fit any category will be' listed under a category 

titled "General". * * 

137 



Organization 

One of the' most significant needs of local ^governments is 
to have d««ta organized by geographical and *polit%:a % l boundaries 
which are more meaningful to them. Th%s could include files 
organized by school district,' ward, traffic district, congres- 
sional district or any of the data ana'lysi^s^reas listed earlier 
A special school -district file was created from the 1970 
census. More of this typ'e of Special tabulation should be made, 
available.' 

It would also b^ helpful if tape*/ were made available by 
subject area such as ^conomi^j^s^ transportation , and health. A* 

} ( * 

number of meetings were he\Ld during preparation for the 1970 - 

census at which this idea was proposed. 'Nothing was ione 4 about 

it then. It is worth reiterating. 

There is a need to supply more income data and have it 

disaggregated into various sub categories such ias government 

income transfer programs (i.e/, how much of a family's income is 

comprised of housing support payments, aid to families with 

dependent children &nd food stamps , 

Along this line, there is a need for finer category break- 

i 

downs in other areas. Some of these categories should also be 

extended. A gqod example'is age. The category "65 and over" 

/ 

isn't very, helpful for planners forking on problems of the aged. 

Another useful addition to v census data files would be the. 

categorization of data on local governments by size. Many other 

resource materials present local government data, by population 

size; (i.e., 500,000 and over, 250,000 - 499,999,' etc.). # This 

• " * 138 



V 

One type of software which would be useful to state agencies 



and some local governments would be a program which wcujld assist 
th^m in making inter-censal projections using census supplied 
data or locally generated data. The decennial censuj data Is 
almost out of date when local governments get accesf to' it. As 
stated earlier, only a few of these governments havfe the capa- 
bility to write their own estimation software. Some, and 
perhaps most, aren't even a war 9 of the techniques Available to 
perform these calculations. 

Another type of software which would be us df uT would be 
programs! which would assist local governments in studying 
transportation patterns and migration patterns. This software 
would be ektremely helpful if it could produce maps and symbols 
on printers which could be easily understood by local government 
personnel concerned with these problems. ' • 

The local publ ic meetings and state agency meetings organi- 
zed by the Census Bureau identified a number of special tabula- 
tions which these governmental units would like to have v produced . 
Most of these tabulations could become available at a reasonable 
cost by a restructuring of the files. Alternatively., the federal 
government might subsidize the ms^hine costs which would result 
^bom processing the files as they ^ere structured in 1970. 



/ 



P r e 8 efi t atian * ' 

7 

/ It i,8 not clear that additional software to support data 

I presentation would be ,useful for the majority of state and local 

governments if an output display device other than a printer 

140 



ERIC V , 146 



wduld aid those interested in comparing data from various 

so urces. ' , ' v . 

* , One item which would be most useful to small local govern- 
ments would be the addition of means, medians, and standard 
deviations for most major items by geo-political and census 
areas • This would help those small local; governments that do 
not produce these simple statistics* This data, accompanied by 
some descriptive documentation explaining the meaning and 
value of the st^at ist ics and some examples of their use, would-be 
very helpful. 

Another item to be considered is the development of tools 
and techniques that woifid allow users to compare data items over 
time for areas whose boundaries continue to change* In this 
same category is the need to tiave county groups not cross state 
lines* This makesit extremely difficult to aggregate dftunty 
group data to the state level in order to increase sample 8i2e 
when working with public use sample tapes* 

Tabulation <r 

If the Census Bureau decides not to prepare files organized 
by geo-political boundaries, and/or subject areas meaningful to 
local governments, than it should be prepared to supply special 
tabulations to meet these needs* These special tabulations 
shoul d be inexpensive to obtain and should beavailable in a 
timely manner. When these special tabulations are prepared, 

there should be a mechanism available by which potential users 

» ^ •■ . 

could learn of their existence* ~ 

• . " . ■ 1 * 139 

# * 

• . ';. ■ l-ir> ' •••• 



were required. The surveys identified in this paper indicate 
that few governmental units have graphic terminals or plotters* 
Thjere seems, to be sufficient software currently available to 

* * c. 

produce tables cfn printers in a variety of formats. 

As an example of the limitations on graphics^capability , of 

4 " -ti- 
the forty states reporting on peripheral equipment in the 

"National Association of State Information Systems surVey ,cited 

earlier, only three listed the availability of graphics terminals 

Although thirty-one listed plotters, the* majority of these wer,e 

located in highway or transportation departments. 

What might be useful. would be the availability to local 

governments of -printouts of meaningful data on their areas. 

This would not necessitate the developmentof new software, but 

i * 

might require *the establishment of a user service sub group at 
the Census Bureau to respond to local government requests. 

As mentioned earlier, the development of a simple multiple 

.crosstab program by The Urban Institute was useful to some local 

» • ■ ■ 

governments. This type of soft-pare development , aimeji at the 
small local governments that are not involved in sophisticated 
analysis, could be very useful. Most of the tabulation software 
currently available requires a level of expertise not available 
in. these governments. # ' * 

• - v. * 

General . . . 

The first thinq that is obvious from the inventories of 
state and local government computers is t^ie need to develop. 

software which is machine independent. If there is" a desire to 

C 141 ' » t ' 



.; - ■ * 

- i * 



• - * * * ' 

' * ' V * ' ■ / ' '* J ' * " ' " * ' • " ' 

• support the full si*e range of syelTems , *it will also tjfe necesWary 

• ♦ - • 4 '■■;•»■. 

to develop software which can .be, run on' 3" system with, a small 
amount of core and few peripheral input arid output -devices^ 

Ihere also seems to be a trend / towards ' the use of mini- 

t ■ ' • ' ' 

cdmputers. Many local governments that-found EPP costs to be a 

limiting factor ifi their acquisition of a computer are now 

rethinking the issue. The "National Association o^ State Ihforma^ 

tion Systems report shows an increasii^ trend on the part of 

sta x te governments in acquiring minicomputers'*, This is an. 'area 

that^ the Census Bureau should explorers a means for .increasing 

access' to their machine readable .products. , 

Since the Census Bureau is releasing' more/f iles with a 

.heirarchical structure (Current Population* Surv'ey, Decennial 

• Public Use Sample, etc.), they should develop software which 
would facilitate the use of these files. Al-so, primary records 
on. these files should indicate the number of sub-records foj- 

Jtifewing when the rkimber of these sub-records is variable. 

Data tapes should be, cleaned and edited prior to their 
'Release. "*Dirty" tapes could be released .when access to the 
, data! is nee*ded before cleaning and editing were completed. 

• These, tapes should be replaced when the cl-eah versions become 
available. A program should be established to alert all 

users to new errors as they are detected. Data, tapes should be. 
treated as a planned product of the Census* Bureau rather than as 

by-produ6ts of othef functions. \ 

* ■ ' 

An*y software prepared for u-ee by states and local govern r 

mejitjr should be available when the data £apes become available. 

' ' * * 142 ■ * . 

.' " - . ' •• . .148 '. :' • 



If this^is not done, those governmental uhits wishing to analyze 
the. data will again develop their own software if they have the 



* 



^capability. The other local governments will have to find other 

• I, ' * 

means for solving data problems. 

Finally, the Bureau nfight consider putting their data files 
' on-line and charging a reasonable fee f or/gccess ^ This *has been 
done by other organizations (an example is the'197iD Decennial 
Pubdic Use Sample file on the ACCESS system at the Massachusetts 

Institute of Technology). If the software supporting these 

~> , 

files made retreival and analysis simple for the unsophlsti— * 
cated user, it would g% far towards solving the data needs af 
states knd local governments . j 

/ * . 

■ ».»*.■■. ■ * , 

9 ■'•>■■ \ 

Although mos-t states .seem to be'capable of ^u^ing census 
developed softwarfe, It appears that the jii^6rity of locals 
governments do not have, the tequipmer)^ or p-ersonmel to avail - 
themselves of these proposed |^Hf cl u c t s 

For -those that do, tjtrere is always the problem of,* trqmsf ^pr 
ility. Danziger spates, ^ in ^reference to software .tran\sfe)b- 

< : , ' - • v_y 

ility, that str iking finding when particular loGal govern- 

' ^< * 

ehts are^^xamined is that successful examples of technology 
transfer are' rare". This view is also supported by The Urban 





.information Systems Inter-Agency Committee's (USAC) experience. 
Of .the millions. of dollars of software developed through the 
USACl program, *only-er relatively few software packages were 

* h i 

t ft 

, adopted by other, municipal'itiea . Conversely, the GBF/DIME 

• 143 



149 



package seems to have fcund wide acceptance by those local 

•a, * 

governments capable of handling' that; partiibuLai software package. 

Although all states and most local governments have the 
need for more ready access to census produced data, only a 
^relatively small number will be able or wilMng to use census 
produced software. T^)is may be largely attributable to the 
insufficient knowledge" at the local government level about how 
to use census data effectively. A well planned training program 
ajmed a,t these governments might well raise the levejTof know- 
ledge, and help to create an environment • ip which census produced 
software could be more effectively utilized. 



ft 

FOOTNOTES 



Kraemer, K. L., Dutton, W. H.,.and Matthews, J. R., "Muriici 
pal Computers:- Growth, Usage, and Management," Urban 
Data Service Reports . Vol'. 7 No. 11 (Washington, D.C.: 
International City Management Association , November' 
1 1975). 

S < 
Matthews, 3. R., Dutton, W. H., Kraemer, K. L. '"County 
Computers: Growth', Usage, and Management," Urban Data 
Services Reports , Vol: 8 No. 2 (Washington, D.C.: Inter- 
national City Management Association, February 1976).*, 

U.S. Bureau of the Census, "Census of Governments., 1972," 
^ol.'l Governmental Organization , G.P.O., Washington, 
D.C. , 1973. . - 

i ' ~~~~ 

Danziger, 3., " Dolnjkj t'e r s , Local Government and the Litany t 
» EDP," Irvine, California: University of California, 
Public Policy Research Organization, 1975. 

"Chief Executives, Local Government and Computers", a 
special report in Nation's Cities , V01. 13 No. io (pp. 
17-40), October 1975. 

f 

National Association for State Information Systems, "Infor- 
mation Systems Technology in State Government," NASIS, 
Lexington Kentucky,' 1977. 

Gueron, j., Ouyang, B., "UI-MCTAB, A -Multiple Crosstab- 
Program," The UVrban Institute, Washington, D.C. 1974.' 

"Synthesis of Local Public Meetings," a report by the U.S. 
, Bureau of the Census, March |977. 

* * i 

"State Agency Meetings Synthesis," a report by the U.S. 
Bureau .of thp Census, September 1976. 

op* cit. Danziger, 3. " v 



4 



• »■ 
145 



BUSINESS USE Of CENSUS DATA 
« Richard B. Ellis 

* • 

Marketing Manager - Information 



; American Telephone & Telegraph Company 

• \ 

APPLICATIONS \ 

\ 

A^Kough the Bell System and its parent company, the American Telephone 
& Telegraph Company, are only a small portion of the vast and complex 
American business community, their use of^census data is L quite varied 
and, hopefully, will cover a majority^ of the applications generally used 



ERIC 



in business today • The Bell System's use of census tlata falls into 
three bro^d categories: 

1) Provision of Products and Services. Many of Bell's basic 

, products and services are currently furnished under 

regulated franchise which carries with it the obligation 

\\ 

to have available what the customer wants when he vaflts 

it at a reasonable cost*. Since relatively long lead 

i 

times are required to manufacture and install some of the 
equipment to permit this, detailed demographic trends and 
forecasts are required for the thousands of areas we - 
serve to predict vLth as much accuracy as possible 
future populations^ and their communications needs. This 
involves such .elements as population size and make-up, 
migration trends, business development, household fortia- 
j tionsand constituencies, etc. 

- Marketing and Corporate Management. For discretionary 
communication products and services ^ Bell is. in 'direct 
an<Lindirect competition with many other suppliers and 



consumer goods* Here, the business objective is to 



optimise its product line, distribution 



channels and 



market position* Although individual market studies are 
often the soiree of the basic data, extrapolation of 
these findings into generalized forecasts, predictions 
^ ^ ^nd strategies is heavily dependent on demographic data 

Typical applications include estimation of market potential 
for individual products or market areas, media^selection 
for promotional activities, selection of areas for 
merchandizing effects and retail outlet site selection* 

„ - Social and Labor Force Studies* As a major social and 

employment force, the Bell System has a requirement to track 
and predict^ changes in the society it serves and the work - 
force it employs, in order to assess the impact of not only 

its own actions but various legislative and judicial mandates 

■ * 

that may come into 'force* Typical problems faced in this 
area include the changing nature of the family /household 

* \ * 

t 

« unit, ethnic balance of the employee group, the entry of 
vqmen into the labor market, and the availability and move- 
ment of skilled craft workers* 



It can be seen then that Bell's need for census data is significant, 
quite varied, and subject to relatively rapid change over time* 

There are three broad areas of concern whicfi transcend, to some extent, 
the categories specified for this conference: ^ 



147 



DATA ACCESS . 

As in the case of many other business users , Bell has relied very heavily, 
on intermediate suppliers for the actual data used and has satisfied a 
minority of its needs by direct access to the Bureau and the original ' 

f 

data. The comments and suggestions of. these suppliers have been incor- 
porated in this paper where appropriate. Although this was, to some 
extent, a planned condition for the 1970 Census and our experience has 
been good, there is an open question as^to whether this is the best way 
, to operate in the long run. As our needs and data volumes^ increase, in- 
house processing may become attractive. Should we obtain such data 
directly or indirectly? Could the Bureau organize to meet demands which, 
in all probability would be sporadic and subject to heavy peak loads? 
There do not appear to be any facile answers,. but the problem should.be 

o 

addressed* 

TIMELINESS ' ' 

I 

An e,ndemic problem for us and most other users we are aware of is the 
s^e^d with which the data becomes physically available for use. A year 
vis the customary minimum from completion of a survey to availability. 

* 

Granted the volumes are huge in many cases, but data processing technology 

' ' » t 
today will surely permit a more timely response. 



HOUSEHOLDS 



In terms of product and service consumption, the household is a very 



9 

ERIC 



i 



148 

154 



complex "unit* In the case* of certain home related service^ or consumer 
doubles (e.g*, basic telephone strvice, furniture) the household itself 
may be construed to be the consumer* tn the case of more personal 
products ,(e*g#, toll calls, clothing) the individual is normally thought 
of as the consumer* In fact, the distribution of purchase and acquisition 
decipions runs the gamut between these extremes, colored in many cases 
by different, value systems and personal perceptions* The present household 

1 ' 

tabulations offered by the census do not adequately address this significant 
diversity* v 



, Specifically, the following items deserve attention* 

1* Below the national levll 1970 Census households income 
distributions were usually broken down into families and 
unrelated individuals • A irtpre useful division would be 

ft 

households with related individuals and those with only 
unrelated individuals* Since 1970 the proportion of 
households in the 'latter category has been increasing and 
indications are that that trend will continued thru 1980* 

If the tabulation for unrelated individuals is retained, * 
it should at least be broken down into single-person 
households and persons in (non-institutional) group 
quarters* Furthermore, this information is of broad 
enough interest to warrant making it readily accessible 

< 

7 in published form. J * ^ 



9 

ERIC 



149 



2. 1970 Census households were typed according. to their "heads". 
This designation will be changed in 1980 to "the person (or 

i 

one of thgr persons f in whose name the home is owned or rented*. 
This suggests three classifications for each of the two house- 
hold categories above:* (1) joint owners /renters; (2) male owner/ 
renter; and (3) female owner/renter. 

» * 

i * 

3. The tabulations in the 1970 Summary £ount did not include 
breakdowns by the number of wage-earners in a* household. 
Particularly in the case of families, this information is an~y 
important determinant of socioeconomic needs and cqjisumption 
patterns* With female participation in the labor force currently 
on the increase, it is important to measure tt\e contribution 
made by working women to a family's (household'^) income. It 
will probably be preferable to base the breakdown on full-time 
workers rather than all wage-earners; i.e., do not include part-' 
time workers • 

■■■'J ' 

4. More research is also needed into the best way(s) to aggregate 
households and persons in terms of the relationships between 
the econpmic decisions they make and their socioeconomic 
characteristics. For instance, which decisions in households 
with multiply wage earners are generally made collectively and 
which are left to individuals. 

Over and above these three general items, other areas of concern include; 



> 150 



* 156 

ER?C - s i 

I 



ORGANIZATION 

> ■ 

1 ♦ Summary Tapes 



After the 1970 Census an additional Fifth Count Summary Tape ^ 

for block groups and enumeration districts (known as File C) 

■i 

was^p^cessed at the expense of one of the suppliers* THis 
tape has been used extensively by organizations which reallocate 
demographic data froth census areas to user-defined areas* The^ 
1980 Census, including Sample questions, should be designed under 
the assumption that a similar tape will *be made available as 
a standard product* 



2. -Public Use^ample Tapes 

a* 1970 PUS tapes had nonstandard labels (leading numeric 

characters rather than alphabet ics) ♦ Unless an important 
reason for this exists, the ease of tape usage would be 
improved by putting standard labels on th^l980 PUS tapes* 

b* Certain of the 1970 tapes contained information for multiple 
states, presumably for reasons of storage efficiei^j^ Users 
needing data on the last state of that tape had to read thru 
the records for all preceding states* If the multi-state 
tapes were organized into separate files ior each state, the 
processings time could be greatly reduced* 

c* When cross-tabulations of particular census data items did 
not appear in the 1970 Summary Tapes, programs were written 



) 



^ 151 

157 



to compile the necessary data from the 1970 PUS tapes. | 
Unfortunately, for Reasons of confidentiality the smallest 
geographic units for which data on the latter tapes is 
specifically identified are individual counties of 250,000 
or more within SMSA's. The 1980 PUS could be broken down to 

: • ■ • \ 

a lower geographic level, e*g», census tracts or rural 

counties, with a corresponding decrease in the number of data 

categories, e*ge, income in $1000 rather than $100 intervals*, 

If disclosure problems still existed, the Census Bureau could 

write a\^eneral-purp-orfe program to produce the crosJ^abulations 

and check the output for confidentiality problems #. The 

usefulness of this program would be maximized if it were 

accessible interactively through the Sumtoary Tape Processing 

* 

*- Centers or their equivalent * 

•. / , • ■ . ■" ; ' .. ■ > 

TABULATION / 

c ' ■ * ■■ : 

1. Racial Classification . % ,' »• 

. • * ' % * : ' 

' It is unnecessary to b£labo*#*he point', but the problem of racial 
classification remains. We are aware that the Bureau is working 

* • 

to ameliorate this difficulty and.it is hoped that they succeed. 
Accurate racial information is essential if work force targets 

♦ 

and other population influenced goals are to be determined on a 
rational basis* f ' 

1. Public Use Sample 

% ■ , . 

if or many applications, the Public Use Sample is too small and, 

in many case?, it is necessary /tier ^cpmbine several political and/ 



/ 1 152 



or economic areas to obtain usable statistics* These then must 
be imputed to the smaller areas within them which is a statistically 
questionable technique* A larger, more detailed sample together . 
with the format suggestions listed under "ORGANIZATION" would 
produce a much more usable and credible product* 



v* . ■■ ■ • • 

Households with Telephones 

\ 

The "need for a survey of households with telephones has been 

documented ("Should 1980 Census Data Include Information on 

# 

Telephones?") Phil Welch, May 20, 1977) an<i acted upon with an 
appropriately worded question In the recent Oakland pretest 
questionnaire* This data will be most valuable t to the user 
community f if it is cfoss-tabulated by other selected character- 
is tics* In particular, households with and •without telephones ~ 
should be cross-tabulated with the demographic characteristics ^ 
of the owner /renter of the housing unit such as his/her age, 
race and sex* These households should also be cross-tabulated 
with total household income, presence and age of chil'direA, and 
the classifications mentioned "Households", above, i.e., families 
vs. unrelated individuals, male vs* female Vs. joint owner"/ 
renters, and number of wage earners* These cross- tabulations 

4 

should not only fulfill the needs of the telephone industry 

and related governmental agencies, but alst allow the many 

* ■ i 

public and private organizations which perform surveys by 

telephone^ to more precisely estimate the bias in the results , 

they compile* 



153 



. t-59 




PRESTATION 

1. Auxiliary information * As census data users we are interested 
• in examining demographic statistics for^areas defined by our 

organizations rather than the census areas. The most practical 
•and time-efficient way to establish the necessary correspondence 
between the^e areas is through the use of geographic or .geodetic 
information provided by the Census Bureau for the census areas* 
At least two such compilations were provided aftejr 1970: 

a* The Master Enumeration District List' (MEDList) contains 
the geodetic coordinates of the population centroids of 
; " blockgroups and enumeration districts* The Census Bureau 

\ is not sure whether they will provide this information for 

a T 

the 1980 Census. Because of its importance and the urgency! 
of its release, the Census Bureau should consider making 
arrangements to have this work done quickly and accurately 
t by an outside organization* * 

b* The maps of census tracts and enume rationed is tr lets are 
' .! essential companions to the MEDList - they are used to . 

verify the geographic translation of user areas into component 
census areas. While the 1970 census tract maps were made 
available on a timely basis, the maps for thfe nontracted 
areas have been very difficult to obtain. Both sets of maps 

v » • 

shoiild be released, shortly after (if not slightly before) 
the Census Day in 1980. t 



-154 



I (JO 



c» % The Urban Atlag contains geodetic definitions of census ^ 
tracts*. The preponderance of errors in this source indicates 
that th6 validation portion of its creation procedure was 
inadequate* s Either (his procedure needs to be impfbved or 

t , t 

the Census Bureau could again consider contracting ffoi; this . 
work with an outside organization* 

2# Alternative medium - The very nature of magnetic tapes leads 
to inefficiencies in terms of serial or sequential processing: 
0 rather than random access* The Census Bureau shouJLd seriously 

consider supplying the 1980 data on another medium, e*g>/, 

floppy disk," that *could be processed more efficiently* 

. * 

SUMMARY m \ 

To summarize ^this statement of our wants, needs and concerns, we would 
like* to offer a brief description of the "ideal" census information 
system from the business user's viewpoint: ~ 

1. Statistics on all census questionnaire responses from short and 
long forms available to the blockgroup/enumeration, district 
(BG /ED) level; I 



2. Cross-tabulations among selected statistics Which are define^ by 
the user; 



3. Sufficient geographic information, e.g*, geodetic references for 
BO/ED , to allow reaggregat ion of census data to user defiffed 
areas; . . 



4; Detailed migration Infornatlon^e.g. , cross.-ref erehce by county^ 
to aid estimation of inteptensal migration} and 



* 5. Information ready jE^r users less than one "year after Its collection. 



\ * 



/ 



s 




1 



.1- 



1 ..A' ' 



S 1 



156 



ERIC • . 



162 



ORGANIZATION OF DATA; CONSIDERATIONS RELEVANT TO THE , 
DEVELOPMENT OF USER ORIENtED SOFTWARE THAT MIGHT 
ENHANCE THE UTILI TY OF DATA GENERATED BY THE 

* ■ — w ' ■ 

U.S. BUREAU OF CENSUS^' 



] " \1 2/ 

' by Mervin E. Muller- ■ 

• - . 

World Bank, Washington, D.C. 



'4i 



Rn invited paper to lead the discussion on Organization of Data 
at the joint American Statistical Association and U.S. Bureau of 
Census Conference on User Oriented Software, November 8-10, 1977, 
Arlington, Virginia. 



2/ Comments made here do not represent official views of the World Bank. 



■\ 157 



0 ' 

ERJC 



163 



*! * , CONTENTS . 

< ■ — * 

* *■ 

, Summary 

. 1 . Introduction ' ' * 1 

2. For What Purpose? . 
"' 3. Who are the users, what are their needs, what are their priorities? 

A*. What Time Horizon! 

A.l For- Planning and Development 

* A. 2 Spa,n of Data 
v A. 3 |ata'by Variable vs. Data by Time Series 

5 Modes ajid' Frequency of Use . . ' . 

%» . * '". * 

* 6. Recognition of Inertia • 

7. A Necessary Prerequisite: Data Identification - f 
, 8. Current Data Base' Management -Systems : i'There is still na,free lunch 

9. Data Organization and Avoidance of Fallacies 

10. Data Organization and Non-numeric Information 

11. Use of' Models to Analyze Data Organization 

12. Procedural vs. Problem Approaches . 

~^3. Distributed, Systems and Distributed Users 

> i . * a • 

1A. Challenges for Statisticians and Computer., .Scientists •• 

*15. Questions anc} Types of Software 

15.1, Questions to be. Selected to Answer 

15.2 Types of Software , ' : 1 . . * ' 

16. Basic Questions, priorities, and Research Directiona , , , 

17. Reasons to be Optimistic ^ . - ■ i 



* ... 



158 



SUMMARY , 

Several questions are ..raised in order to identify the complexities 
and challenges that are Involved in trying to understand better what is 
the problem of data organization. These questions should help 

the discussion to take y place during these meetings by indicating areas 
of research and development. Some of the questions have been made in 
order to ensure {hat they will be addressed. These questions are not 
necessarily new bilt are ones' that must be faced by those currently 
involved with statistical analyses using computers even though satis- 
factory aolutions may not be forthcoming at this time, 

■i ■* 

INTRODU CTION 

Under the terms of reference of this conference, this paper has 
been prepared to stimulate thinking p4fer to the conference and during 
the conference in order that we can focus more effectively on what 
types of software ought to be developed to aid in the area of data . 
organization. . This problem must be viewed in a rather general context 

in order to justify. the attention given to it at this conference. 

. * 

It is much larger than one might first believe. It is tempting to 

assume that all we need to do is select 'from among the existing data 

* ' ? 

base management systems and our problem will, in fact, be solved,. 

I hope this paper will generate light, rather than heat; Having 
stated this,hope> I want to question whether we have An adequate 
understanding of what we are trying to accomplish^ even though the 

objectives sent to us prior to this meeting were clearly presented. I 

\ 

expect to raise several questions that are pj-ovocative and hopefully\ useful, 
stimulating the kind of thinking the subject needs. I had considered 



159 



ERIC 



and discarded several alternatives for this paper, such as: 
1) -summarizing the history of the subject, 2) advocating a particular 

ft 

approachfor system, 3) evaluating existing systems, or A) emphasizing 

* . * { 

existing limitations • I hope through considering questions we can 

i 

develop proper respect for the problem. and the importance of establishing 

.■ * ' * ' 

priorities for a meaningful and effective research and development, effort 

in this 'area* * t - v. 

i - . ■ ■* 

2 . For What Purpose ? * . 

The indic^t^ purpose of the conference is for ''the development 
and perfection of software which will enhance utility' of data generated . 
by the Bureau". The \ conference w ill also "examine ( the needfor software 

♦ 

improvements from the* user's standpoint and help 'determine the extent 

to which the development of software i% *uj appropriate topic for research 

support by the NSF/ASA. " Although these statements are clear enoh^h, I 

V > 

believe that we need to make them more specific in order to provide a 
focus' for what should be considered. I think it is important for the . 
conference attendees to discuss and refine the purpose of the conference; 
I hope the questions raised* in this .paper will help clarify the point, 
M for what purpose? 11 as well as help to focus attention on subsequent : 
actions to* be taken bas.ed on the conference. 

3. Who are the users, what are their needs and what are their -priorities ? 

*, The term "user" can mean different- things to different people-. 
Users could be those directly within the Bureau, or those within other 

* » 

parts of the Department of Commerce, other parts of government, or those 
external to government. It is important to know who the users are and 

' 160 



what their backgrounds are expected to be; are they to be professional 
statisticians, Experts in computing, or subject matter specialists who 
will have the appropriate supporting staff, equipment, and software to 

v 

assist them in the use of data? It is necessary to identify what their 
needs are 1 -- particularly, what their data needs are. Can they^be sure 
that they have useful 'data and -data/ Identification In thesenae of. the 
fbllowing: how will they cope with missing data? How will they be able 
to recognize questionable a/curacy or quality? These questions will be 
dealt with again in Section 9 on Data Organization, Different people 
have different needs, and to develop appropriate software for data 
organization (s ) , it is necessary to identify who are the users, and 
what are tt\eir needs , Finally, what are the relative priorities of 
different u$er needs? It would be irresponsible to ignore the matter 



of priorities, since users clearly have finite resources. Even a govern- 



ment agency mikt also face the reality that it has neither the time nor 



the resources to meet all Software or data needs of all users. Therefore 

{ ■ • * * 

^ when directing planning and development, attention must be given to how 

\J^ ne W ° U ^ 8 ° a ^ oUt identif Y in 8 user needs agd establishing priorities 
for, whit is to be done, 1 



0 

4, What Time Horizon? 



To have proper 'perspective for the discussion to follow, it is 

necessary to look at least on two aspects of time: the tljne horizon 

of planning and development, and ,the time span of the data themselves, 

1 . • t • * 

■'•(,*-■ - 
By focussing on thesjB two aspects of^time, I believe we can ask relevant 

■ I * ' s 

questions and see mote clearly how to meet the objectives of this, 

conference. Consequently, both aspects of time ape given attention 

before proceeding to sbiM&of the other considerations. For completeness, 

a third aspect of time is d^Lso mentioned. 



A.i For Planning and Development ♦ 

Whenever we look ahead, therV are at least two pitfalls: first, 
confining ourselves to the use of current technology and 'knowledge we 
possess About how to use such technology to solve today's problems; and 
second, restricting our thinking about the problems themselves due to 
conservatism or recognition of the limitations of current technology. 

When looking at the question of the development of user oriented software, 

f 

it is not at all clear whether we aTe talking about what can be done this 
year; or three years hence at the tine "of the 1980 census; -or at" the time 
of the next decennial census in 1990; or 20 years ahead in the year 2000. 
The symbolic year 1984, indeed, falls in the early part of this broader 
planning period. - • 

In looking forward we might also look back a similar time period to 

assess- progress made. 

**** 

Twenty years ago Fisher was still with us; computing was in its infancy. 
How far have we come since then? The breadth of application of statistical 
techniques has been greatly influenced by the availability of statistical 
software on digital computers. With few exceptions, notably in graphics, ^ 
dnd some changes in emphasis notably towards iterative methods, the world 

S 

is much as Fisher knew it. We are still, in the main, equipped. analytically 
to handle numerical data in rectangular form (univariate or. multivariate) 
variables by observations Vy time. j 

Although we are notable to store and retrieve non-numeric data, or 
data in non-rectangular interrelated structures, we lack analytical tools 
to support analysis directly using mo,re (^pmplex data structures. 

It is important to be realistic as to what time, horizon we are 

\ 

addressing as we proceed in the subsequent discussion before we xan • 

162 - * 



really be sure* what types of planning and development would be appropriate 
for consideration. For example, 'Is it Bieanlngful to consider that signi- 
ficant technological or theoretical break-throughs may occur in time to 
be of benefit? Are we looking ahead to the possibility of a data network 
where; the hardware and/or data can be considered distributed, geographically 
and logically? Clearly, if this is a possibility, then more attention 
must be given to improve4 ease of access to the data in the presence of 
controls. which recognize privacy , confidentiality and| security, and this 
affects the selection of data organizations. According to the time horizon, 

I can easily imagine that ve will develop different plans and approaches. 

* * 
4.2 Span of Data 

In looking at questions of data organization, there are two questions 
regarding the time span of the data: 1) are the data (actual or predicted) 
to be organized and maintained only for current time periods or current 
time periods plus historical periods? 2$ are the data fpr each time period 

i * 

to be maintained separately? The influence of these considerations on data 
organization also depends upon the extent of data and the , frequency of use. 

V 

\ 

The possibility of data migration from one hardware device to another is 
also affected by whether 'the data must be currently available or available 
only for historical archival purposes. We Will address this point in a 1 
later section. 

A. 3 « Data by Variable vs. Data by Time Periods * 
If we think of data organized as timifcy series, this type of organization 
is not the one naturally employed when collecting social or economic data, • 
but it may be* t\ie desirable type of data organization for analysis or 

reporting purposes. Ust^lly we obtain social or economic data for a given 

o 

time* .point or period for many variables. This is the natural way to collect 

A 

* 163 



census data. However, for a given variable an analysis, even for data 
consistency, may make it neceBsaf^ to use, data by variable across time 
periods. The time aspects of data for a ifciyisA variable raise many 
interesting challenges and questions with respect to data organization. 
When data are stored on a direct access device there can be an erroneous 
impression that it is immaterial how the data are organized and stored. 
That is, to assemble a time series of the values ^Xj (t), for t - 1, 2...T^ 
for a given variable X ± , when the data are stored by time period and 
variable, some people may assume that it is convenient and efficient to 
# \, retrieve the desired data values by searching for each time value of 

each variable. This assumption may be correct if for n data points the 
search effort can be done in less than Knlogn operations. However, re- 
organizing the data to be a collection of time series by first sorting 
the data and then using it sequentially may be a more efficient and 
• effective approach. 

Even with such brief considerations of this section," I think you 
will agree that it is important for data organization to take into 
account \the many time aspects of data. \ 



ERJC 



♦ 



5. Modes and Frequency of Use 

It is necessary to consider the modes of^data use and the frequency 
of data use. Frequency of data use, will have important ramifications for 
data organization, which are considered (n more detail in Section 9. 
I find it useful to distinguish four categories of computer use, namely, 
production mode,! diagnostic test mode, tutorial mode, and exploratory mode. 
As noted in Muller (1969), one reason for considering these four modes is 
to facilitate separating the problems of using computers into understandable 



164 

• i i i'O 



and manageable parts, which may also help clarify issues aIKi close the 
current gaps between hopes and achievements, in use of computers. 
Another 'reason is to obtain better understanding of where to allbcatc* 
research and development effort in programming and statistical techniques. 
Some of us still suffer from the expectation that a given "general program 
can be all things to all people. Of the four modes of use, the one that 

V 

most people think of is the production mode, i.e. the one the user employs 
to accomplish a specific computing job which no longer requires testing 

programs. It is assumed one knows what he wants. done and how to do it 

♦ 

(even though the user may also need help of the diagnostic test mode.) 



The diagnostic mode is used to aid in testing whether or not a 
program or N Rackage can in fact be used for production purposes. 

In a tutorial mode one may want help from a specialized computer 
program to learn, for example, 1) how to use a program, 2) how to 
understand and use available data, 3) how to use the available computer 
facilities, or 4) what programs or data are available. The 'tutorial mode 
is intended to support the learning of a particular body of knowledge. 
Jn the context of the current conference, the tutorial mode might enable 

I 

users of Census Bureau data to explore various data bases and software 
that can bemused, including descriptions of data Structures that are 
available, and data coding conventions and the like which dre relevant 
to using the data. 

An alternative to the tutorial mode is to maintain and distribute 
comparable information by more conventional means. The questions to be 
answered here are those of costs and benefits of each approach. 

The fourth mode, exploratory mode, allows the user to explore 
existing programs, computer languages, and operating systems so they 



165 



ERIC 



±71 



an understand what they^are doing* For example, what? levels of precision 

t * 

of calculations are available? Is truncated* or rounded arithmetic used 
in the programs? ■ * *«. . 

6. Recognition of Inertia ! 

• * * * ^ 

: In spite of spectacular technological achievements in hardware, it 
is important to recognize that development of computing techniques for 
improving the quality and usefulness of data suffer from inertia, in 
particular progress in the software thai wtfuld be required to bring about 
changes commensurate with the spectacular improvements in hardware. 
If one now reviews the proceedings of the 1969 conference on statistical^ 
computing held in Wisconsin, it will be noted that most of the open 
research and development , problems identified then ate still with us. 
(See tfiltoh and Nelder (1969)). There are few significant l^reak^throughs 
in statistical techniques for data editing^data analyses for presentation, 

or data organization; the work of Fellegi and Holt gn datsi editing, or the % 

; ^ V . , 

work on intervention analysis by Box and Tiao or on data organization by 
Merten are exceptional cases. Thus the lead times betweeti identifying 
problems and finding practical solution^ may be very' long. • One must 
recognize how difficult it can be to overcome inertia without a high \ 
priority emphasis and critical investment of people s time. Although we 
have on-line and interactive computing capabilities, we are far from the , 
situation of being able to perform on-line, interactive statistical analysis.. 

This conference and the subsequent commitment of considerable resources 
may provide the critical mass needed to overcome the current intertia, if 
there is adequate follow-up. This inertia is reinforced by the present 
concern over privacy and fears of invasion of privacy, as well as byv broader 
issues of confidentiality, including unintentional disclosure. 



166 



172 



ERIC 



i • * i * - ' * f • 

' Another type ofc inertia is the failure to recognize how little 

\ . 
progress has Ween made on standards for data identification and control. . 

Until there is such progress, the obstacles to portability of software 

and data (see e.g. Muller (1975)) will inhibit^ slow down, or preclude.. 

1 i 

effective general use of available d^ta. t *' 

7, A Necessary Pre-requisite: Data Identification 

For those who were practicing statisticians before the wide use of 
computers, data code books were a familiar part of a well-designed data %♦ 
collection and analysis- process . "Cod^"' is used here to include any type 
of data identification, A few computer-based systems have computer-readable 
^ode books; some people refer to them as "data dictionaries 11 or, A\ prefer, 
"data glossaries" (to' indicate a capability richer than just a code book or 
dictionary, 'see Muller (1963)). I seriously question how data can be easily 

• ■ " ■ . ' ' ■ - lm 4 , fs 

portable without a clear indication that codes can have different meaningsV , > 
at different times, or that at a given time multiple codes may have the C 
same meaning. It is unrealistic to expect that' this problem can be overcome 
by universal standards. Instead, I would urge that a ^ecessary pre-requisite 
to improving the use of data is to create data-base directories which will 
^enable the user to recpgnize and cope with different interpretations of 
'data identification. Such d#ta directories often must include the identifi- 
cation of the quality, source, and timeliness of the data. They may also 

Include the identification Of the various data, >s true tur es ' used. 

• ■ <i 

8. Current Data Bask Management Systems: "There is still no free lunch" . 

' ■ . . / . , - * ; 

There are many aspects to the current literature on data, base management. 

There is the schema of total data base management where one looks for a way 

of describing ^the logical properties of the enterprise, or agency, the use 

"# ° 

of data, arid the logical organization of the data to be used. There .are 

167 % 



some Impressive capabilities, such as data definition languages. 

\ ■ 

Unfortunately, many of the Important stochastic considerations that 
influence how to design and effectively use such data bases either are 
not handled in existing data management systems or are ignored* The 0 
data base systems are usually designed as if to be used in a totally 
deterministic manner* v _ 

We seldom get anything free. Data base systems require an investment 
of resources to acquire or build £he system ad well- as the, cost of maintain- 
ing it, converting to it, and training people iti how to use it* In some, 
respects those advocating or using data base management systems atre justify- 
ing them on the ground o£ increase in programmer productivity, with arguments 
similar to those/ employed to justify higher level programming languages as 
replacements to machine qode or assemblers. There is clearly a need to 

. . - .» • 

increase programming productivity. ;In this senpe, some data base management 

systems can provide programming tools to facilitate the input, output, and 

, transfery>f data* across physical storage devices. 

Associated with these tools is the expectation that there will be 

greater data and program independence as* a result of having "the appropriate - 

data base, management systeift 11 . Another 'expectation is that th^ system is 

extensible to changing user data needs. Although* some of these systems 

have been around for a long time, I have . not seen cpse histories documenting 

how such* systems have contributed to improved statistical analyses or better 

portability of data. Unless one is clear abo,ut the time horizon and the 

needed research and development for organization of data, great opportunities 
\ 

for the 1 distribution of data bases by d&ta networks. will be missed 4 or delayed 

•.• ■" * i ... . - ;■ ~- . 

because data base management capabilities (techniques and software) are not > 

fi * - , , 

adequate to take advantage of the hardware and telecpmmunlcatlons enhancements. 



To face these emerging* problems by means of newly- designed "data base 
management systems" which do not yet fcxist will take tim^* and,. could be 
costly. A$ statisticians, we should be interested in the collection and 

analyses of data to evaluate how to design, use, or modify sjich syst^ns 

* , . * 

Of data base management / recognizing that pre-packaged systems are not 

likely to solve all of the important ptoblems, . 

9» Data Organization and Avoidance of Fallacies 

• m 

> The literature is full of papers on how ff best"*to organize data, 

" * * • . 

as if there were some set of criteria of optimum data organization. 

By itself, such a factor as frequency of use Is ai| inadequate criterion 

for deciding how. to organize the data. Even with additional information 

tfiete may l*e no •Optimum" data organization, see Merten and Muller (1972). 

For exclusively batch processing, one might want a data organization that 

would minimize the average access time, whereas in an interactive use of 

data one might need a form of data organization which would ensure stability 

of response time — forexa4ple, a minimum variance in the service access ti,pe 
1 

to obtain the data. Unfortunately, there is 'no single optimum data organi- 

** 

zation. 

Another fallacy is that there should be only a single data organization 

\ 

for a given s§t of data., This is one of the limitations associated. with 

\ . • k 

current data base systems t " As a minimum, one may want one type of data 

organization for the effective and efficient maintenance of* the data, 

but multiple forms of data organization for different types of use to • 

* ■ > ' • i 

be* made of the data — for example, a data organization by time period and 
a data organization by variable to aid the /construction of time series. 
The question* of what shou/d be "the 11 data organization is-tpo general a 

* ■ A 4 

formulation to be of much concrete Value, In many respects, organizing * 

data for 'effective use resembles designing a queueing system with the/ 

. < ■ * 

169 



arrivals and possibly* t the service being stochastic- processes* In .addition 

to this poipt* will the time horizon fter the research ajid development 

V • 

effort cover a sufficient span to consider distributed hardware ind data 



bases? Is it heces^ry to maintain historical data? Does the frequency ^ 
or volume <of use warrant techniques to allow for "the migration of data to 
various physical devices? As a minimum, the data may be organized in such 

a way as to be portable by having the identification of data and coding 

% . ' ^ / 

< \v 

structures, and the data codes external to the data content. Current data 

* » 1 - * * o 

A . 

> v 

base management syste^ sometimes inhibit portability of data, or make , it 
necessary "for a potential user' of the* data to make a large investment to 
acquire the entire data base system itt-^rder to use a given, set of data. 
Furthermore, for some applications, control must be provided against < 
unwarranted, access • , Such systems could be . unacceptable because of the 
need to reprocess or even reorganize the data So that they -can become 
portable to multiple users with different access privileges, r , 

• As in. the case df hardware, ij^ is ^reasonable to look forward to - 
large economies of Scale thrtfugh having data bases maintaine4 and 

* '. * ■ 

distributed from central data services. if/ so, additional* research by 

i # - . 

statisticians^ will be needed to determirie what'kitjd of data, wl^^^he 
data should^ be located, and how it should be organized. Jlere, again, \we„ 
will need criteria tQ indtsaJtejS^o the users are, for \$at purposes they 
h$ed the data, what are their mcrdes and frequency- of use t * We also need 

$ . * - 

to keep current on the relative .costs of transmission and pfDcedllring of 

data. I hope I have not disappointed anybody by recommendfAg a relatively * 

•* / ' ' " . ■ « / ' V«i • 

mo desin approach t6 these problems; I do not believe that the field haa liad 



enough research' or is matured ejiouffeh to cope satisfactorily vith the 
complexity of * the present situation. / 



Cdnsiderirtg tiata organization from a purely deterministic point 
of view^auch of the current literature which follows the results bf 

the CODAS YL Committee is relevant, Thi$. point of view treats data 

'< ' " . • ... 

organization. in terms of logical and physical' descriptions to aid 

computer programmers, and several important issues of languages for data 

description* and data structure % are addressed. For example, data systems 

are described as network models, hierarchical" models, or relational ' 

models, to mention a few. If one looks closely ( at these' efforts, 

• * 

however v no criteria are being* put forward ijv terms of how many levels 
crt a hierarchy one should have or, in the relational model, h<w^ one 
describes the data internally to achieve efficient use of the data^ 
Much of this effort is aimed at allowing data independence so that 
programs and data can be changed without Effecting the end-users, 

> \ • 

Although such formal descriptions of data bases can be of great 

•', C ■ '•' 4 . 

help,, they neglect the questions of effectiveness and efficiency, and 
I believe these issues are stochastic in nature. Also neglected is the 
matter «f indicating or organizing dat,a according to source, quality,' 
or timeliness. We statisticians recognize that there are a wide s;J 

of problem^ wherfe stratification can improve sampling 'efficiencj 

' ) > * * ' .. _ 

Similar advantages can be gaiTied through using stratification technl^Ris 

with regard to the organization and distribution of d^ta bases. With- 

stratification it may be advantageous to establish one or more data or 

T - ( 

access directories at various leve^p of a hierarchy or network, Stratifi 
cation* can also help to eliminate conflicts on data access wi^ixf^tRe 

computer, as well as to improve service performauce with respect to 

' 1 ' ' ' <* 

average service time or yarianc'e of service time^~to mention two per- 

4 - * % • \* J * 

formance characteristics which one might want t£> consider* It is itot 



at all clear what performance criteria one should use. One can formulate 
the performance problem as a mathematical programming model and look at* 
the question of optimisation relative to some objective function,' but to 

■* * . • \ • . . , 

date 1 have not found this a very fcseful description other than to 

demonstrate the existence of a solution* see fctr example Merten (1970). 

* ■ 

In view of* the sensitivity of "optima" to assumptions about da£a which 
themselves are subject to unknown changes »^£y£*bo is required here. 
Perhaps the views of those in attendance, can help clarify the -priority 
to be given to optimization criteria. 

Data organization includes the question of security and, control, 
what types of user accesa will be allowed and for what purpose. 
Furthermore , some parts of a recbi?d may be considered sensitive 

-. * / 

therefore should have some type of ertcryp.ting or scrambling to protect 
the sensitive parts — another ctyb where multiply files using different 
forms of organization may be appropriate: / 

10 . Data Organization and Non-numeric Information* 

In the future some types of data organization should exist td handle 

* * 

non-numeric information, which I believe is necessary to consider, especially 

' ' " v % 

tf the time horizon of the research and development effort exceeds a few 

4 v * 

years, prdinaril^, one tendp to consider non-nutoeric information to be 

\ \ \ 
synonymous -with text. Even this kind* of data offers unexploited oppor- ^ 

tunitles foy_ data aijalysis. Although some types of data orgahization 

already include^ the facility to handle text such as footnotes, report 

titles, table headings J stubs, and user instructions, I believe that we 

fieed to consider mor6 complicated data organizations and storage facilities, 

capable of handling digital representations of graphs, maps, and pictures. 

With satellite capabilities to "collect pictures" and "create maps", and 

with the emergence of satellite or fiber optics communications for digital 
- *. ■* • Al72 . \ 



transmission, statisticians now need to plan how they can improve analysis 

\ 

and presentation of such results. Additional statistical techniques may ] 

v* 

alsd be required to use this technology* ?he challenge of non-numeric 

♦ * \ "° m 

information is here; are we prepared to accept it? ■ 

\ » 

11. Use of Models to Analyze Data Organization 4 

* 

^ Analytical models to^describe. and evaluate the performance of data, 
organizations or „the associated software can have a place in a r^barch 
and development effort if they provide useful reductions of the c<flfeplexities 
of the real world. I believe that? the models should have a stochastic { 
orientation. Such models, to be oiseful, must reflect qualitative as well 
as quantitative factors of relevance to the user, such as ease of learning 
or ease of use. However, care mudt be taken not to lose slight of tiie end \ 
lobjective of achieving effective data organization and software. Unless 
one can collect real. data to validate tHe reasonableness of a mWel, one - 
should, in my opinion, suspect the conclusions or usefulness of modeling 
efforts . • ■ » 

12. Procedural vs . Problem Approaches m ^ • , *- 
* Most of the higher-level languages available today are effective if 

•one is prepared to describe a problem using data (or the organijlfction of 

data) in terms of procedures. The same could be said of most large-scale 

statistical packages that are now available to analyze- data. One of the 

attractions of some data base systems is that they -have commands which 

... v " 

are more problem-oriented than procedure-oriente^. The advantage of such. 



a command structure depends on how important it is to adoptr-a problem 
approach rathe* than the procedural approach to the^se of the data. 
The question is how much *resedrch and dAvelopm^jt effort is needed herex 



' 173 



4 W 

> * i 

The answer will depend on identifying the userp, their needs, and 'the 

» * * ♦ ■ 

time horizon, v If the users are everts in programming and have been 

trained in ways that exist today , then it would, seem natural to use a 

* \ * 

procedural approach. However, in looking ahead it is not at all clear 

that Ithis is whab is desired if it is intended, to stimulate the use 

t . * 

of census data outside of the Bureau,. v * , 

With problem-oriented software one could describe the problem rather 
than the procedures — for example, identify the file, »the particular^ record 'A 
types of fields within the file that one would want — -and then- the^ criteria 
for selection and analyses of data, rather-'tfcaij the Retailed procedures. 
On the basis- of the problem ^specifications, ^special compilers or translators 
would anklyze the problem specification', either to generate procedltra^ caljaa 
for use by .conventional compilers or* to translate *the specifications to ^ 
procedures ftinterpretatively • I believe this 'is a» fruitful area of rfcpearch* 

t **Ttoe problem approach has exiguities , not so ihuoh in the syntax for 
problem specification as in the semantics *of determining whether or not 
the specification of the problem permits a correct) unique and unambiguous 
computer execution* Without -trying to prejudge ^what. the, future direction 
of the study should be, I think it should start with straightforward and 

ft N | 

practical problems followed by cases of greater complexity. Some of my f 
colleagues and I have been looking at this^bhallenge for some time, and " 
we believe it has relevance to situations involving the need to accomplish 
mull:!- dimensional data array manipulations and transformations. In £hls 
afea we feel we have "been relatively successful, but it is an area needlr *< 

additional research and develdpment, see for example Muller (1977), 

t • /» 



J- 



13. Distributed Systems and Distributed Usefr ^ 



It is difficult to accept the views, held by some" data tiase systems. 



► advocates that current data base management systems provide absolution . 

to the data organization problem. Under the^best^ctf conditions, such • 

• < ■ \ , ■ ■ ■ 4 ■ 

1 systeins may be solving some of today's problems, but these are not 



necessarily the problems that; 



•wilT*1 



be' facing us for the next few years 



A sizable i 



•5 



y<s required to Select and install currei\f data base 



<5> 



systeins. Such investment may divert resources from needed research and 

development . - % ^ ~ 

t " s 

As data files become larger, it ^eems logical to - * expect that ttiere 
will be air increasing need for data organization that allocs % the dara - 
to. ,be distributed Across hierarchical storagei devices. ^ It is logical to 

♦ 

expect that, depending upon the time horizon under consideration, the 

v ■ ' mmm ' « 

data could be distributed geographically. Depending on who the users 
are, and their objectives, it seems reasonable to investigate distributed 

• 6 

% 

data bases 'as, a logical* and effective approach. The question then arises, 
is it reasonable to assume that* the users need distributed data? I believe 
it is realistic to assume that the users will b^^istributed and wan.t to 
'use distributed data bases. Attention must be giveh* to access control, 
security, and the need for tutorial modes* of use *to eriable users to 
understand and use' data if they no longer go to a trenttral facility to 



acquire the data. This raises problems of maintenance both of the data 
and of" software. Consequently, the question I see here is, what criteria 
should one consider as statisticians in making decisions about distribution 

♦ * 

of hardware, software, and the users, and what ratifications will ,t:his have 
on the usability of data? v . 



ERIC 



175 



IS l . 



2& 



As noted earlier, the question of di3tribution of users Is related 

.' » * 

to the .question 'of economies, o J scale* Large generrfl-putpose mptihines 
have been popular because they offer economies of scale*.. lAtpllig'ent 
terminals Jith local memory undoubtedly will generate additional uses 
of centralized ( large-scale general purpose machines. I beiieve w* can 

expect to realize economies of scale for data bas^s. in data networks 

' .* > 

with smart terminals without necessarily having all the data irf* one file. t 
One of the questions that needs clarification is how to Achieve effective- 
ness 'and still enjoy economies of scale* r , * 

•7 • ' '• 

14. Challenges for Statisticians and Computer Scientists ' 

It is a real challenge to bring together computer scientists and 

statistician's t||pLdentlfy who the users are,„ and what their needs are. 

A second challenge is to recognize that the design and evaluation of 

♦ 

systems to cope with data gaps involves a* problem of statistical analysis. 
A third is, th* need for evaluations of the performance of different data 
organizations,, the software using the data from such organizations, and 
the software for the statistical analysis using the given data organi- 
zations. The evaluation, I believe, should be based on carefully designed 
statistical experiments so that one can estimate the main effects and 
interaction effects of the various parameters' one might have under control 
I am using the term "interaction effects" in the sense employed by a 

*" » * 

statistician who has designed, say, a factorial experiment. I believe 

this is a very 'fruitful and necessary area to consider^ one well worth 

receiving an allocation of resources for future research, and I would 

that attention will be given to this area. ^ 
^7 J 



/ 



i 



'1 



15 • questions and Types of Software 

15.1 Questions to be answered • 

The type! of software research and development td be recommended by 
this conference depend in part upon which questions we decide "should be 
pursued. _ * % ^ 

The questions can include: s 

\ * ■ ■ . 

* Time horizon* * . ■ 

- planning and ^development for:. 1978, 1980, 1984, 1990 >r ? 

- span o^'data: current only, historical only, future, 
*"* or some combinatiop$ " ' 

- data by variable vs. data* by time period 
Data for what purposes^ 



Who are the users, what are their needs, what" ate their .priorities? 
What modes of use are to be supported: production, diagnostic, 

tutorial, exploratory? . f 
How frequently are the data to be updated, distributed, used? 
What data identification will *be needed, how will it be 
. c^stribut^d, and how* will i,t b*e maintained? • 
Will data standards b& fprmulated and maintained? 
Will portable data directories be established and required? 
Typed v£f data base systems: centralized, distributed^ decentralized. 
Where should the data be located? * tf 

Who should control access to the data or the data directories? 
W&l. non-nqmeric information be part of some of* the data bases? 

ftill statistical techniques be use^ % to gather data or perform 

y " . J 
% analyses, to influence data organization] 



V 



o 

ERIC 



- 1 S3 



ERIC 



• What software will be developed that are acceptable to 

different users for conversions of data from one typ^ 

of data organization to others? * 7 

* • I 

• Will there be an agency prepared to provide "feoftwar^ to 

convert data? - 

* * * 

- * What levels of data security and confidentiality are required? 

• What back-up facilities are required to ensure uninterrupted 

user services? ...♦•»'• 

• Should problem-oriented software be developed to access and 

f 

use the data bases? . f 

•■»."• What extent of distributed systems and users are to be supported? 

• What financial and human resources can.be made ^available for 

various types of effort? 
All of these questions have political as well as technical aspectg, 
especially thq»e involving security, privacy, confidentiality, the use 
of f distributed data or networks; or use of the data \y commercial service . 
bureaus . , . 

15.2 Types of Software • . 

The types of software to be developed depend in part upon how the 
selected questions are answered. In addition, the types of software 
to be developed should reflect the kinds of statistical ^nalyses that 

are expected to be needed and availably. I am concerned that unless 

* * 

explicit attention is focussed on .statistical questions, software 
development will be undertaken without an adequate underlying statistical, 
basis. Take, for example, analyses allowing for* missing data or techniques, 
to classify, multivariate data as being suspect or defective dependingjupon 
how the dataware to be presented- fr used. , \ ..»■. _ / 



. A 

I 



x!78 



Regardless of 1 the types of software to be developed, there are 
further questions needing attention, possibly by others not attending 
this\onference. These questions include who will 
* # develop the software 

* test the software 

** * 

■ - # distribute the software 

* * • • 

^ # maintain, the software 

* administer requests to change the software. 

* ■ „ ■ * 

I believe software is needed to: 

* collect, and maintain data on fhe use of data bases (suth data 

can be used for evaluation purposes to influence dat# organi- 

zation as- well a^ 7 clarify whetheY there is sufficient demand 

« * 

for use of the data) 

control access for the creation, modification, removal, or 

<» « 

distribution of data, as well as determine when simultaneous 
use of the data can be permitted, 

# maintain portable data directories for those who have different 
' . data organizations or equipment 

# restrict data to forms that are compatible with the user's 

environment 

# handle centralized or decentralized data bases 

* monitor use of dat;a so as to notify users when, subsequent 

#■ 

to their access. to the data, errors are detected in the data, 

including audit trails' where needed. 

i * 

• store, retrieve, and use non-numeric statistical image infor- 

i 

mktton such as graphs, maps, and pictures 

monitor lise of data # to estimate what data to have, where, and 

for whom . ? 
, 179 

"185- 



• develop Software to allocate and realloc^te-dynamicaJ.ly the 

locations of the data, the amount of main memory to" be used, 

* • * 

and the access routines to be used. Such software may belong 

to the operating control system of tl^ hardware network,, but 

■ z - - 

it should be designed iii such a way as to be portable for users, 
•.make possible uninterrupted service or error recovery with a« minimum 
loss of information for any users accessing data bases by 
means of a data network * 

• provide problem-oriented 'softwares as well as procedure-oriented 

software. k ■•.»#. 

I am assuming that software to enable use of distributed equipment-will be 
available as well as necessary software to create audit trails making 
possible data recovery due to environmental or equipment interruptions. 

16. Basic Questions, Priorities, and Research Directions . • \ 

I have raised several questions which. I believ.e to be basic, in order 3 
to identify and understand the challenge* that ought to be faced now and 

* , ■ - ■ . 

in the next few years. We must recognize that priorities are to be 
established and that resources are to be found and allocated. 'Depending- 
on the time' horizon selected, and the resources that can be expected to* be . 
available over the period, it may be necessary to assign relatiwi priorities 
"on the basis of likelihood of success, or^at the other -extreme, on the basis 

\ t 

of likelihood that the projects are of such long duration and high risk that 
no. other] group cQuld be^ expected to. handle them. Therefore, the research 
directions could be the selection either of safe efforts with high likelihood 

'*'*■'* r ' * • / 

of success or efforts £hat are the most risky, ^Leaving the safer ones to 
others who do not have large staff or other resources'. Sometime before 

A * J * 

this, conference «nds I hope that we will stimulate interest in seeking 

v, . 180 



answers to the questions of who are the users, what should be done, .and 
with what priorities. Some of these Questions can /be resolved by the \ 
use of cost/benefit analyses. These can be difficult since they should 

* ■ v ■ ■ 

take into account social and economic \ costs and benefits ak #611 A 
financial, « %J 

]|7. Reasons to be Optimistic* 
* 

In spite df the large number'of questions' that I have proposed, ft 
I am- optimistic, "becaufc* L believe tha^ many bf the significant enhance- ' 
ments and developments that have taken place (in computing were developed 
to meet -the needs of statisticians at the Bureau of Census, Therefore, 
I believe that if we concentrate on needs and the required statistical 
tools, the development of the appropriate software and hardware will 
follow, Today ; it seems easiejr to consider hardware development, 
I believe that if we? concentrate on the analytical statistical questions, 
the subsequent software development will take place. 

T am optimistic because I believe that meaningful research can 
oni'y result fro)^ having real and practical problems, Again, if one 
looks^ack at the- influence of the^ Bureau of Census on development 
brfth statistics and hardware, this was successful because it was related 
to real needs and, real problems, \ 

i am- also optimistic- because We see a joint effort^ between the 

Bureau and ASA. This is* good, because many of the problems requi^ 

people from more than one discipline, ^especially in the area of determining 

Jiow to perform evaluations of software, Irt this sense, the existence of 

the ASA Section oa Statistical Computing is another reason for' optimism, 

as are some .of the activities taking place outside the United States. 

We are increasingly looking beyond our* shores , in the area of computing, 
.. 181 



as we have in the past with respect to the theory and application 

/: . i 

of statistic?* . \ 

I am optimistic also because of activities' such as those planned 
for the International Association for Statistical Computing* 

I am optimistic .because I can pee significant contributions * 
being made by groups outside the.United Spates that can easily influence 
the kinds of activities that ought* to be taking place within the United 
States, Consider, for example, computer-based tla^a editing, such as 

"that which is going on in the World Fertility Survey through CONCOR," 

*■ * 

and in the efforts of Statistics Canada. . 

/ - , . 

Finally,'^ feel optimistic because of the recognition of the need 

■••■•I • 

to hold such a conference as Jthis one, , composed of peiople prepared to 
meet in working groups and devote time and effort to identify what^needs 



to be done** ' * 

i ' . " • • - •. 

ACKNOWLEDGMENTS . 

This is to express my appreciation to George W. Barclay, 
George M. MinJtch, and Leonard Steinberg for their ftelpful c6mments 
on the initial draft of this paper. ' 

REFERENCES • + \ 

— ' *- ' > t ■ \ 

• * 

Mexten, Alan (1970), "Some Quantitative Techniques* for File Organi- 
zation",^ Ph. D. Thesis, Computer Sciences Department, University 
of Wisconsin. ^ 

rten, A.G., and Muller, M.E. (1972) , "Variance minimization in 
single , machine sequencing problems 1 ', Management Sci. , 18, No. 8, 
pp. 518-528. , ♦ • K - • . . 

Milton/ R.C. an^ Neldfer, J. A. eds.\ (1969), Statistical Computation , 
Academic Press, New York. 

Mtiller, M.E. (196.3)^!'A foundation for modern tdols of management", 
1963 ^roceeding^^Bftpter national Conference sponsored by the \ 
American Ins tit\»PB^ Industrial Engineers, New York, pp. 123-134. 

' . V< , 182 ^ - 



Mulder, M.E/ (1969), "Statistics and computers in relation to 
latge data bases", Statistical Computation , Milton, R.C. and 
Nelder, J,A,, Eds., Academic Press, New York, pp. 87-176, 

Muller, M.E, (1975), "Portability standards for software", 
Computer Science and Statistics » proceedings of Eighth 
Annual Symposium of the Interface, ed. 'J.W. Frane: ILC.L.A., 
173-176, . ' ' ' 

Muller, M.E. (1977), VAn apiproach to multidimensional data f 
array .processing by computer", Comm . A,C>M M 20, No. 2, 
pp. 63-77 ...... 




* ■ 



Organization' of. Data for Census Users 
By vBruce Carmichael , Warren Besore , and Kam Tse 
Systems Software Division 
U:S. Bureau of the Census J 



Herman Hollerith the, inventor of the punch-card tabulating machine that 
was the forerunner of modem computers , was a Census-employee. His invention 
was motivated by the* volume of data acquired inji decennial census that by 1890 
had grown to the ^extent that current methods* were hard-pressed to complete the 
processing of one census before the next was begun. 

Tne problem of data volume instill with the Census* Bureau and its users, 
t * 

in the 1970 census, information was collected from some 65 million households. 

J ' 

• TWenty per cent of these households completed a long form of the. census quest ion- 

»• 

naire that provided a comprehensive, view of their lifestyle. Today the Bureau 
is looking increasingly to sophisticated data organization scheme? and 'access 
methods to manage this huge volume of data. 

t 

This conference was convened to examine the probldto of distribution of 
^Census data: specifically whether the distribution of software for accessing 
and processing Census data would make^this data more easily ac^es^ible and ' 
^loser to the v needs x>f user^v It is readily understood that if data is dis- 
tributed in a manner that requires extensive processing to extract information 

in a useable form, its use is restricted^ to those* wh$> possess the facilities 

\ ' "... 

* . » 

and the funds to afford the processing. ' Th^s. pap^r looks at some -of the new 
techniques in data organization that thfc Bur'eau is. using , and 'some of the u 

■' \ " '* ' ' . V 

facilities available commeTcially^.to |ee if the Bureau's data organization i 
technology can be extended to service the needs* of users. t ^ 



♦ j 



184 



/ 



er|c 



190 



; 



I CURRENT PRACTICE / * 

9 

Perhaps a good place to start is to look at the way in which data dis- 
tributed by the Bureau is currently organized, and the software available 
for accessing it,. While the subject of this discussion covers all types of 
.data distributed by the Bureau, we will cover briefly data derived from the 
decennial census as an illustration of the format in which data is or- 
ganised for distribution. 

" - • ' ' - ' /' " ' 

Raw data from a decennial census is stored on magnetic tape , grouped 
by geographic area. Once the raw data is editediuid validated, a set of 
•basic data tapes is. created for use wi^fiin the^ Bureau, These tapes com- 
prise the basic, or mic^o, data frpm whidi^^ files and Special tabu- , 
lations a^re derived*, jfecause'no discl^sur^ analysis of cprtfidential ^for- 



mation has been ap£li6d, these basic datA tapds must ?e*main internal fil 



External 



Data faganj/zat iona 



and/use 




From tae basic/ data files, various summary files and public use samples 
are prepared and disclosure analyses and data suppression are applie^ on 

M thes£ tiles. These fi/es are maintained on tape for distribution to Census 

^ / / / . 

data users.* Ttye summary files are grouped into two categories: summary 



ind su/j< 



counts and subject Reports. The Nummary counts 'for the 1970 census occupy 

4k P I , - 

^^proximaMly tape ' reel^ , the subject reports about 400 -reels, and the 
.public use samplj/s another ^()0 reels. These, tapes are available in 556 , 800, ^ 
and 1600 typi recording densities. „ 

The summary count tapps are ordered by the type 6f tables contained, level 
of jjeogtaphy / and state y/ Thus if one wefe interested ji the population aged 

25-54 livW in Suitlana and earning over $15' 000 , this information would be 

» • / 

.located in /'FlLE C" o^ the "5th COUNT" Maryland summary tapes. Since four 

S ' h 

t 185 

J * 

in . 




4 



tape reels are required to hold this file t the user would probably have to read 
all four reels to locate the. desired information. 

A set of extraction programs called DUAUst is available to assist the 

• ■ 

user in locating and displaying tables on the summary tapes. For certain 
cases, the^ extraction programs even had limited aggregation capabilities. .* 
More extensive or detailed extractions' had to be performed by custom programs. 

• * 

' * a, 

Because of the difficulty* of developing such programs, processing centers 
employed specialized staff for this work. 

There are several improvements that could be made in the distribution of 
data. Newer tape drives permit higher recording densities, requiring fewer 

✓ 

reels of tape to hold the files. Different tape formats can facilitate pro- 
cessing the data. Alarger variety of extractions can reduce the amount of 
process inj^a^quired of the,users\ ^ 

On the whole, though sequential file organization., and sequential pro- 
cessing. of 'data, has reache^/its limit. If we are to ever make any^ progress 
in reducing the cost p£/process ing Census data, we must come up with a new way 
of organizing this^vast volume of data that will make it manageable. This 
organization mexhod should include a common logical model of the data and its 
sti^ctur><and a common method for accessing the data. Furthermore, this model 
sho^rld be compatible with the Bureau's internal -structures. -Compatibility 

Lcl be beneficial in two ways. Data could be^ more timely if user-accessible^ 
data would be updated in the same form as internal data rather than having 

• * ♦ 

to be translated,] Secondly it would require fewer resources to extend the 
Bureau's software systems than to go through the process of developing a new 
system from scratch, and then, interfacing it with the Bureau's. 



* u - ~ J 186 , x 



t ' 192 



0 * C 



II. COMERCIALLY AVAILABLE SOFTWARE, HARDWARE, AND SERVICES 

A brief survey of Commercially available computer products and services 
might be appropriate to identify items which could be utilized in distributing 
Census, data. « * , . ' 

* ■ ■ 

Software Packages * \ . 

There are well over a*hundred conmercially available packages that are 
advertised as Data Base Management Systems x (DBMS) and the number is constantly 
growing. Packages conforming to the report published in 1970 by the CODASYL 
committee are available for the equipment produced by each major computer manu- 
fac^urer. * This report is the basis of an industry standard for a Data Manipu- 
lation Language (DML) and a/Data Definition, Language (DDL) for data bases built 
on a network model. More recently,- with the emerging research by Codd and 
others , new DBMS are being developed and tested which present to the, user a 
relational model of a data base. . 

Generally speaking, commercially available packages can be classified into 
the following categories : 

Data Retrieval Systems * 

File Management Systems 

Complex File Systems ^ 

* 

Data Base Management Systems * * 

Special -Purpose Systems 



Due to the volume and complexity of Census Datai weiwill limit ourselves "to 
surveying DBMS only. The following table presents basic information on 11 of 
the more popular DBMSi Host language packages provide a DML that is embedded in 
a conventional high-level programing language, usually COBOL, FORTRAN, or PL/1. 

Translation of the DML i$r generally implemented through an enhanced compiler^ ' 

* ■ "** 

. 187 \ 



Host Language CODAS YL 

Data Base Manageme nt Systems 



V 
4 ■ 


DM81100 


IDM8; 


"•7 ' 

HonaywaU ID8-II. * 


■ • 1 

CPU 

... _ .}. 


UNIVAC1100 


H • . - . 

IBM 370 

T 


\ , H6000 


, Itsm Datcrlption 

i 


COBOL Oriantad 


Hoit Language Like ' 


. CO JoL Lika 


• — I ■ ; 

Lojtotf 


. 4 f 

, Network 

>■ 


Network 


Natwork 


. Physical 


Pointers 


Pointers 


• . ■ * 

Pointers *■ 


Aocass Msthods 


Direct 

Hc*ed 

ISAM 

Network ' 


Direct 

MMhtd 

Network 

'.V 


Direct v ' 
Heshed 
ISAM 
4 Network 


O.B. Ccaation 


« User Progrems > 

,. a» i 


Uitr Programs 


Utility 


) Qwry Language 

* #> — 


Yti 

. — ^ — *» * * 


No 

^ .> 




Report Qatmttor 


r ■> 

Y« 


** 

No 


No » » ■ 


Host Language 

— * — — — 


COBOL 

a?»^^Mhew#m a\ a*, a 

FORTRAN 


COBOL 
FORTRAN 


COBOL 


Multl-thrtttJ 


Yas > 


Yas • 


Yas ' 


Security 


Nona 


fhrtJ Subschema 


Password 


Data VdWatkwi 


4 None 


, Nona 


Yas 

a. 


|- 

Reobvery 


FullScale 

• 


Full-Scab • 


FullScala 


SurvtHlence 

* — i 


Log Taperand 
Statistics , 
Collection 


Nona 

i ' 


Yas 

♦ 



Figure 1 

9 




Host Lang««0« Non-CODASYL 

Data Base Management Systems 



/ 


Burroughs 
DMSII 


Clnoom 
TOTAL 


IBM 
IMS 


MRI ' 
System 2000 


* Software AO 
ADABAS 


CPU 

/ • 


Burroughs 


IBM 370 •• 
CDC 6000 
UNI VAC 70 


IBM 370 


IBM 370 
UNI VAC 1100 
CDC 6000 


IBM 370 


fl tft^sst^sijss* — - ■ ssalsssa ^Irta* 

Item Description 


Host Languaga 
Uka 


Host Languaga 
Ltk« 


nosi Languaga 
Like 


Moil canguaga 
Like 


Host Langutyp 
Like 


Logical 


Natwork 


Multi-list 


Hierarchy 


Traa Structured 


Almost 
Relottonel ' 


rfiyimi 


roiniar 


roiniar 


Mojacancy 


T\ 1 ■ 


. < • « 
rpjntari 


Acoatt M#thodi 

/ 

* • 


Dlrtct, Hisad % 
ISAM, Bit Vactor~ 

NafteVnrk 

Ijlallljwf K 


Dlract 
Saquanilal 




Dlract 
SaquantM 
Inwartad Indlcas 


Dlract 
Invar tad 
liwffoai 

1 iruivf a 

Hashed 


U.D. UTMIIOfl 


utar programs 


uiar rrogr ami 

* 


• 

Hew 

Programs 


lltllltv A 
Utility d 

Usar Program* 


1 Itllltw A 

User 

Programs 


u Ouary Languaga 


Yes 


Yai 

v 


Yai 


Yai 

s 

- - ' % 


Yea 


Raport Ganaratoc 


Yes 


Yat X 


Yas • 


Yea 


Yea 


1 T" — 

Host Languaga 


ALGOL 

PL/1 

COBOL 


^— 

Any lang. 

with tub- 

routina calls 

* 


COBOL 
PL/I 

Assemblar 


COBOL 
FORTRAN 

* 


COBOL 

FORTRAN 

PL/I 

Aseemblsm 

ADASCRIPT • 


Multlthraad 


Ytt 


Til 


Yas 


Yea 


Yas 


Sacurlty 


None 


Nona 


Yas 


' Yas 


Yas 


Data Validation 


Some 


Nona 


Nona 


Soma • 


Soma 


Racotfary 


Fullscele 


Soma'* 


~ ^ — 


FullScata 


tulltcala 


Sunralltanca 


, Sam* 


Nona. 


Soma 


Lop Tapas ... 


* Log Tapas 


•••to: POM 1 -j 

Hottaywtil 2000 > • 

IBMSyttam/3 ' . ^ 

NCRCthtury Figure Z 

V«l«iV?0 ' ' i 

• V 

eric ; . # 



SELF-CONTAINED 

Data Base Managcme nt Systems 



• * 


Computer Corp. of America 
Modal 204 


*" V 

Maada Technology A 
Data/Cantral 


TRW 
.. QIM II 
<C # — 


CPU 


-IBM 370 - 

3. 


IBM 370 

• 


IBM 370 , 
UNIVAC^IOO 
\ PDP-11 V- 

» 


m 

lltm Description 


Character String 


Character String 


Chiractar string m 
v Numtrk 


Logical. 


Almost.Rtlatlontl 


Multi-list 


Almost Rtlttlonat 

SL : ^ 


'.'^ 

Physical 


Pokittfv? 


Adjtctncy 


fWnttrs 


A<jcass Methods * 

I 


. ? — ^ - 

StquanUil 

, jt" Invtrttd Indict* 
Harfwd 


liWartad Indicts 
Stqutntla) 

\ 


Invarted Indicts 
Hafted 

s 


B,B, CrMtion 


Utility 


■ ■ ». 

Utility 


Utility 


Qutry Languega ' 


▼ ■ • ■ — " — • — — ■ — ■ — ■ 

Yes 


Yas 


rv 


Report Generator 


^ K 

No 


No 

• . i 

L . . _ 


Yas 


Hoit Linguage 


COBOL, FORTRAN, 

PL/1, AmnSWtr 

i 


Any languags with 
subroutlna call ^ 


. COBOL and Own « 


~5? : : — 

Multi thread 


Yas 


* 

Ym 




— n , . ^ — 

Security „ 


Yes 

"V 


Ym 


Ym 


Data Validation . 


Yat 

* 


No 


Ym 


— 1 

Recovery 


Soma 


Vt» 


Ym 


: 

Surveillance 


vf Yal 


Ym 


Soma 



. ':•> ' Figure *3 

% - • . • . 



g. pre-compiler or subroutine c^lls. Many DBMS also offer self-contained,/ 
query languages for og-line interactive retrieval and update* : - . 

/ ■•• • • .. .•: • . ■■ . ••. . ■ 

HardwarySys t ems 

*• ./ * ■ ^\ # « ? •/ * . ■ . \ 

In most production- 0 environments DBMS is treated like any other ]pro- ' 
gram sharing the resources of the host computer! 4 Even in those installations 
that dedicate a computer to data base applications, the hardware configuration 
and operating system software are not modified. It' is common, for big cor- 
porations or government agencies to employ a large- or medium-scale computer 
running a DBMS supplied by a software firm or by the computer manufacturer. 

* In -these installations , the data base resides on mass-storage. Numerous 
interactive terminals access the computer for on-line updates and instant* 
information retrievals.* Jobs for batch updates and periodic report genera- 

' 4 

tion are either run concurrently with on-line processing or during off-shifts, 

depending on the capacity of the hardware. V 

i * m m 

With the recent proliferation of minicomputers, many firms have "come to 
possess one or more .of them. There are two basic methods of employing minis 
for data base applications. One is a stand-alone system. Smaller companies 

■ . 

. 

* may own or share only one mini which they use for all their computing. require- 

ments t including clata base. , p 

*. • '. 
A second method is a distributed network. Bigger corporations may own 

several minis and possibly some large- or medium-scale computers, in geo- 

graphically dispersed locations. In addition, they may have a mimlter'of data 

bases of various sizes 9 some of which are useful only to a particular branch. 

In this instance, a distributed data base network would be more suitable. Each 

v ■ . 

node of the network would possess a mini to handle its local data base work, 



191 



197 



ft 



In addition to thV traditional approaches > there has been active resfearch 
toward ^he implementation of a so-called data base machine. Some researchers 
are considering a hybrid machine in which special processors are added to the ( 
conventional .general -purpose computers. For example, one. such attempt was to 
add an Associated File Processor, implemented on a PDP-11, to perform associative 
(parallel) searching of a very large textual data. base. Others have suggested 
that the architecture of the conventional combuter should be changed /to acco- 
modate the functions provided by the PBMS especially those connected with the 

' ' . - \ -) 

relational model. . . 

i • 

• Computer Service Organizations i 

■ .»In-the current marketplace , it is unnecessary for an organization to own 
■ ■ ; . . v 

or rent a computer in order to have access to diversified computing services , 

including data base packages. f "Many companies are in .the business of providing 

a computing utility, much in the way the phon^ company provides a cojimmicatiqns 

utility. V 0ne such service is General Electric's Mark III , which is' described 

here as an illustration of the kinds of services available. TMs^is not meant 

to imply that Mark ,111* is either the best or most (comprehensive of such services. 

Mark III has thousands of customers on a world-wide network; Many of the 
customers have iarge volutes of data*stored on Mark III. -Each customer can 
access his data base interactively or in a batch jnode^ Using either his own pro- 
grams or a generalized software package furnished by G.E. 

Local phone numbers are^available in all major U.S. cities that allow users 
to connect to the Mark III network. Twenty- four hour > toll-free service numbers 

are, staffed by consultants who will assist a user needing help or encountering 

. ■ 't' .' . . 

problems . • » 

i - , • 

■ ' : ' ' ' i 

. 192 ■ 

198 • '• . 



, Generalized software currently available on Mark III includes their own data 
base^paclcage, DMS II , which interfaces with, FORTRAN as well as with specialized 
software packages such as plotting routines, report writers, and interactive /query 
programs., .Non -programmers can perform their own statistical manipulation of the 
data, such as row and column sums,, averages', percentages-, and 'deviations. 

Custom, Census Pat a^ Processing Services - 



. If a user of Census data requires a more customized .form of computer service; 
he can turn to onje of a number, of outside organizations equipped, to perform 
specialized pressing of summar^ and sample data. Some of these organizations 
" provide *a broad line of services^^^jile others have concentrated on specialized 
types of work-. 

r> . 1 * 

One such organization is DUAL Labs. Again , this description is intended as *' 
an illustration and ^es not imply endorsement of any organization. D^ilL Labs is 
a non-profit corporation offering a-yariety of services. They provide consulting 
services and 4 training as well as custom processing of Census data. DUAL Labs does 
not have its own computer installation, but instead buys computing services to 
support theii^work. "A fair amount of generalized software has been developed by 
DUAL- Labs, including extract ion. software for summary data that makes use of a data 
dictionary and provides aggregation capability* and software for making and doc'u- 

menting vertical and horizontal cuts of public use samples. .This software has 

• ' • . >»■•'. 

also been sold to .users . Some DUAL Labs cooperating, off ices pfovide their soft- • 

ware on a time-sharing basis to users. In fact, DUAL Labs provides the type of 

service that many countries offer through their, national statistical offices. 

» • - . 

> 

V 

Other organizations, such as National Planning Data, provide more specialized 

-> 1 

services > su^H as making ED data available on microfiche > digitizing tract bound- 

aries, ojr providing population density or affirmative act ion* information. 

♦ • ■ ' ■ k ." . 

■ 193 m . - ■• , 



III. PLANNED' DEVELOPMENT 



"The Census Bureau is wQrking toward an integrated system for the collection > 
processing; and presentation of 'Census data: The. focal point of this system will* 

be a data base management system that will, provide the- structure' for and access 

^ • ■ ■ - ' - ■ * • 

to the, -data. v * - . - 

The Bureau/has selected Univac's -DMS-11Q0 for its initial development /work. 
A data base for administrative data is already operational under this systeli^ 



One area to which the" Bureau is applying new (data organization technology • 

is disclosure analysis and data suppression. A system for automated disclosure < 

%l ' '«'*■'' ' ' • 

analysis is *beihg developed for use in the* 1977 Economic Census. This system 

uses a highly- structured and eas inaccessible geqgraphic lattice to provide 

. ♦ *■ - »'* • 

containment and* intersection information. 

/''.-■ ■ • ; 

Another ^urrent data base project is the Master Reference File for the 1980 • 
demographic census. This file will be linked <to a geographic lattice: The data 
base will' allow interactive reference and update for such pre -census activity as 
mailing counts and boundary and annexation changes , as well as controlling the 
activities* oi enumerators across {he country during .^he census. Preliminary .field 
counts wflfi.1 be compared with predicted counts in -each geographic area to determine 
whether they appear reasonable % and counts that are suspect will be flagged for 
re-count. '( ' ^ ' 



-From these current projects, the Bureau's aim is to develops good model for 
• .i '■ ' ' . 

geographic structure of 'its data , and to develop an in-place geographic lattice 

the third level of the Bureau's Generalized tabulation System > will .' 
contain an interface for data base access. GTS-3 will use .the data base both for 

' " t * 

t 

source data and storage of intermeg'iate results. Data base interfaces will also 

be built fo^ graphics and statistical analysis systems. ^ 



194 



Data base technology? serves two functions at the Bcjrea^- One function is 



integrate data. It provides a structure for data jtm improves the efficiency 



of data access t since needed items may be accessed/without* passing the whole %' 
file. It separates physical storage from applications logic ^ providing -flexibility 

in storage medium and allowing access software /to be optimized. It avoids dupli- 

■• ' / 

cation of data \ and provides, a single control oT all data allowing rapid distri- 

'•'' k " • • f ■ ' 

■hut ion and correction of data while avoiding the problems of consistency encountered 

when data is "kept v in many files. Data base technology also serves to integrate 

• . *- ' \. » ' • " - 

software by providing a common form for passing data between processing subsystems. > 



* 7'. 



iy. VPOTENTTIAL FOR development 1 y 

' * ,;• * ; .. ''• . . / 

External Data Organization and Ifee . . ** , 

.-. ••- • • *' 

= As the technology of new hardware and software systems ma/tes the use 
*• ' ' * • »• • . '' ' v " • .* • .. 1 ••' :•< 

of more highly- structured data M possibility for Census data users J,*/it Will »' 

> " . ; ■ * . ■. • • ■ ■ . - 
^ecome incre4singly important to. develop a common logical model of Census^ 

data , both at the* summary and micro levels , that- is compatible With the 

" s> V * ' ■ • ,..••.»• I. ~ . 

• * , ... ■ ' » 14 ■•*.•» 

Bureau's data organization. A common -model wiM also allow , the distribution.- - 



of pre -structured data on' new mass storage media such ast^oneycomb cells/of. 
holograms. It will also make it possible to take advantage^of new. data or- 
ganization technology, sucj as associative accessing, without^nodifying user 
programs. , 

The^addition* of a time, dimension to Census data is another innovation that 

* '■■ '>' : •*'•'. * ' f ^ *' \ 5 • i * 

will be possible through the use of new hardware and software technology. Access^ 

.to time-series data allows the projection bf trends and patterns, but requires 

* ' ' • » ».•••' 

massive amounts, of on-line s^torage and sophisticated retrieval techniques. The 

Census Bureau has 'developed a simple ti,me-series data base system that is used on 

small economic data bases. Statistics Canada provides limited amounts of time- v 

.,»."' , > .' '.•*'•'• 

series data through its CANSIM system, In the^Future , we will probably see heavy 

new development in this area. •„ • ^ 

f- - . 

••■'.* .', • 

As the external user is provided with larger masses of data summarized in 

• ; . ' • " 

time-series form, the problems; of ^disclosure ^ analysis and data suppression be*- 

* ' ..:'■< 

„come more difficult-. Intersecting disclosure problems in a time-se?ies datA base 

> . • i ' .,•».♦•■■ 

have received almest no attention so. far* f , , . i. ^fe 

Shared Internal and External Data Use - 

; — 1 • " — . — ~ . 

Most of the data collected by the Census Bureau could- be shared with data 

users pnce the' disclosure told stofage technology , problems have- been solved. If < 

/^bis is going to happen, the* Census Bureau and -its users mustwoflc jointly to 

196 " • 



coifce* to an I agreement on the bes.t logical model of the data to- be shared. The model 

'should be as simple and, as free from' "compiTterese" as possible, so that, a statistician 

- . ' 'I " 

or other subject matter analyst canl^drk with it directly. * 

At the same time, new media and formats must be'ggxplored for the storage 
representation 6f data. A strict /separation must be maintained between the logical 

4 * \ , 

and physical models so that new technology wftV.be' transparent to th^user. Formats 
should be standardized so that data is -easily transported from one site to .another. 
Both- the format and medium of data exchange should efficiently support the common 
logical view of the data. / $ j* . * ♦ 

In addition to format and medium i there should be a well-designed common logi- 
cal model supported by, a compatible data organization. Changes in the organization 
should be transparent to the user, and transportability between machines should be L 
maintained. Although tape is* currently the primary medium for transporting data-- 

i 

and hence sequential organization is predominant— in the future fiat a' may be trans- *^ 

ported as, holograms , floppy disks > or bubble fields , making alternative datfr 

- ' ' " / 

organizations practical. 4 

^ - 

Shared Internal and External Software 

: ' — -m — * • 

Once a common logical model of Census data is- achieved, formats are standardized, 

< * ♦ • 

and transportable data organizations are developed , it will be possible to share 

• 'i ■ * 

data management software that has been specialized to handle Census data. This 
shared software will need to be transportable over a variety of computer hardware. 
Transportability may be achieved either by producing and maintaining. multiple 
versions of the software, each implemented for a particular machine" but having 
identical user interface > or by. producing and maintaining a single version written 
in a high-level language for which most machines have a standard compiler. 

■ . * 1 



197 



203 



\ 



1 v ' 

• With shared use of data management software ^comes the, possibility of 

distributing Census data^jn a pre-loaded data base format. This would elimj.-' 



nate the duplication of the time-constmiing data structuring operations at 
every site. In fact . more complex datlf structures could thence feasible, . * 
since /the work involved in producing \the/structures is done 6nly once. At, the ' 
same time, more complex data structures' 1 could provide the user with a faster and 
more versatile retrieval capability.. With proper data structuring; micro data 
could easily replace summary level data in many instances , s.fnce the cost of /* 
producing special tabulations should become very low. ♦ , ' ' S '/ ' 

Shared Internal and External Data Center Use * ' 

An easier solution to the problem of sharing data management software and 

• ' \ ' .. 

structured data is^throug^ the use of a shared data center. As mentioned in 

- » > 

section III 9 facilities fo^r, multiple users with' divers^ problems , residing over 
large geographic area, sharing a common computer facility. is currently avail- 



ERIC 



)le. It should be pointed TSat that any such facility could not house confident 
p.al data, and hence could not co-exist with many normal Census Bureau functions. 

could, however, easily be a normal part o£ the activity of some time-sharing 
service. In fact y sane 1970 tensus data is npw available on some commercial time- 
sharing services . 7 " / 

Under current disclosure guidelines > it would be fully possible to have the 
total, 1980 Census surtmary data* files and public use samples available through a . 
time- sharing service to any' and all interested users. ; More study would b<^ re- 
quired to determine the feasibility of placing the entire irticro data base into a 
♦ timQ- sharing environment. In order to make such a concept useful, one woUld need 
to be able to do special tabulations cheaply from the micro data and to insure the 
non- disclosure of confidential data. * \ 

' ' • ■ 

198 \ • \ % 



^/Withii) xhe next te^tf years , hardware and software should be developed to 



^a ^oint /that the entire micro data file and many supmary tables could be main- 
'twined on-lirie % • This will make the development* of standard statistical data ^ / 
base packages important*' At the same time > certain data users may prefer to* 
y/cohtinuy t(/^ extract portions of the Xarge data base and re- load those portibns, 

/ into ptjhej data base packages.;- This/ 7 iotal operation could be performed wjthin 
th^t^tcxt of a single^ time-sh^ril/g , environment* The cost of all servides 
wmiM be paid by . the user directly/ to the time-sharing service. 

/ Such an environment would ^allov^the v development of a truly inte^at^bd 
system of generalized software interfaces to an up-to-date versioiWf the 
(jSeJistts data base. It solves the problem, of standardization^ >(ardware/ j 
/software configuration. It would allow ^For the expansion |)fjcensy4 data dis- 
/ semination to include new areas such as' current population surveys and economic 



■■ '/data, perhaps even in time- series form. 



199 



205 




V. ' DATA BASE IMPACT. ON CENSUS DATA USERS -^ISSUES ,OF CONCERN 

♦ 

♦ 

Disclosure Analysis 



♦ By act of Congress, data. about individuals collected in y 




the 



various census 

and surveys conducted by the Bureau Cannot be disclosed in such* a/way as # t6 



r 



allow identification of the 4 individuals. However f% in cert/ain cases, data con- 
cerning, an individual/ person , farm, or compa^njr^cou9.d be /eriwM from unedited 
Summaries. For example ,/ if county A has one very small/ peanut farm and one 

very: large one , then publication of data on peanut farming/in County A' would 

* . / ♦ '• V. 

necessarily disclose much information on the, big peanut farm, -In ordpr, to 

protect against this kind of unwarranted disclosure? /the Census Bureau spends 

a larg^&mount of time and effort in editing the EmA to' be published. In' the 

* 'I I - % 

past*., this was done manually. , Analysts and expats on disclosure examined the 

'< ' 

data yafrle by table, editing it according to vf fiet of prescribed rules. More- 



over 



4 it was /necessary to sometimes modify tables' for related geographic levels 

y 7 7 . ' • - • * - 

to, protect against disclosure through mferejice/ * ; 

■ ., /' . ' I J ' 

In. any/ shared data center, it is essential not only to insure that/ disclosure 
problems dp not exist, but valso to* avoid/any Appearance tfyat^ight imply dis- 



closure oj confidential information* Fcfr thjs r^^on , although it may be techfii- 
dally -feasible to develop software to anitomatically perform disclosure analysis 
and data/ suppression ij^ is highly unlJlkely/that outside usefrs would be allowed to 
shai^-a/clat^ 'base containing confidential information. /Ihe state of the art in 

• -. ■ r ^ , / / v x ■ . 

data base security simply is not adeaua'te zo ^istify such a risk 

/ . 

Accuracy of Pata 

* y 1 1 . 1 ■ — — . 

One~of the. primary concerns of/ data users is the accuracy ,of* their data, 

'( / ' * ' 9 

This is a particularly strong ^a^peat of the shared data base environment. Because 

I * v ; \ 

of the ease of correction in a datja base , as post -tabulat ion activities reveal 



ERLC 



errors in the cjata, immedi ate correction to the shared Mata base cain take place. 
I<n the p£ 

\ / 200- ; ■ . ' 




correction was generally not performed due to the 

' 2bo ; ; ■ . ' 



.v. • ''>'! • > •> ' . 

magnitude of the'jcfe.* Instead,, errata -sheets were, published warning users 

mm ^ 

of various discrepancies whenever poss^iblev , 

j ' • 
The. availability of a shared data base also makes it much. easier to per- 

' » V f ' v • . v 

form inter- and intra-table consistency checks^ Not only would such a capa-* 

. • vy • • f . ' • x ' 

•bildty help the Bureau locate, and correct problems , but .would also help data 

Users convince themselves of the validity of the "data. 



Timeliness of Data 

— : *i — *~r- — 



A user-accessible data base of Census information can improve the timeli- 

* •' ■ ■ : \ . • . 

ness of data delivery in three major ways. Data could be loaded into the data 

: ■ . ' - • f " 

base as it 'is processed,, .eliminating the, normal distribution delay and making 

the data immediately available^ Secondly, as the need for correction of summary , 

..**;* • ~) * A 

data is discovered, thoseO;or.rectiohS can actually be made in the data 4 base , making 

them immediately available to the users. Thirdly, as the original Census data 

ages, new survey information could- be made available on a time-series basis to^ 

i * 

augment the original, ^ata. This could be extremely valuable to researchers in- 

* ■ " • 

terested in short-term trends and projections.' 

« * * 

« . t 

Co st of Data Delivery ^ 

* 

The total cost to the user for delivery of his final' data product should be , 

greatly reduced in a DBMS environment . This is primarily due to ^he fact that 

only the exact quantity and content of information needed "to supply the request 

•....■«•■ - \ | 

must be processed. The data base eliminates repeated traversals of a large 

sequential file to extract a limited amount of information. It also eliminates 

much of the programmer cost associated with writing and debugging custom programs' 

for" .summary tape processing! Finally , there should be a significant cost reduction 



simply because of the scale of/tke operation and the fact that the processing 
cenier focuses directly on the processing of Census data. . - . 

Ease of Use » 

• -. , - « *. 

. „ " . ' • ft s 

* One of the most important impacts of suth a data base would be the easy 
availability of the vase amounts of Census data^to users who are", not' computer- 
, oriented. The user view, and interface language of the data base system could 
be such that non-pi^jrammQrs would feel at ease in employing it. in addition,' 
immediate help for such, non -programmers . could be made .available through both HELP 
commands on the system and hot-line service from the center. . J 

Adaptibility ,/ . < 

It'*would be important to balance the data base carefully s # o th#£*good service 
v could be obtained byA>oth the small request from a non-prograimier and the large 
request from a cus^6m program* In addition 9 the data base must be -smoothly in- . 
terfaced to other Statistical software packages provide aggregation, display, 
graphics presentation f and computation capabilities* * 



4 

$ 



1 



r 



i 

202 



S08 



v . • . - . 

VI. CLOSING REMARKS * ■ ' ' , 

The use of &ta base organization techniques for Census user data is both 
feasible and cost effective. Several different approaches to the problem seem to 
be promising. At a most fundamental level *^data tapes that are distributed to 
users could be % reorganized to provide. a limited amount of tape-priented table 
indexing and chaining, of data' based orTftie structure of the internal databases. 
A more useful approach would be distribution of pi;e -loaded data base tapes for 

■ V ■ . 

a select ^roup of the most popular data base packages. If it were possible to 

define a common set of database software that was machine independent r or, v 

easily transportable the. software and pre- loaded ^iata bases could be distributed 

together. But the most viable and potentially useful approach seems to be the 

availability of a Census user data base on a national computer time-sharing u 

* i •* * * * 

network.- This data bas£ could be maintained by the Census Bureau and accessed 

jby anyone wishing to make use of the data and able to pay thfe access cost. 

If we are £o pursue any of these possibilities, we need to make a decision 
now. Future cpoperative efforts will' affect the Bureau f s development strategy 9 
as well as the strategy of users 1 development. It will also be necessary to ^ 
allocate resources to provide for future development. j- 



f 



4 



203 



20,9. 



GENERALIZED STATISTICAL TABULATION 



Nev York, N.Y. 





By Hugh P. Brophy., U.N. Statistical Office, 
N 

Introduction * 

The general subject of access to census data' includes the regular 
programme of publication, the provision of summary tapes arid software" 
for rising them and the production of ad hoc tabulations. In all cashes, 
the task of statistical tabulation is directly or indirectly 'involved. . 
Small wonder then that it is a topic recedyisg special emphasis, in this 
Conference* \ 

As one who became Involved in the implementation of a generalised ' 
statistical table generator in the mid-sixties, and who considered proudly 
that the system produced then solved all 'the interesting, problems, It Is 
sobering to be involved in a Conference in 1977. that is discussing the 3 
feasibility of a project aimed at the very same software 'task. Bit, wiser 
now, I recognize that my. efforts and those of K many others have failed short 
of anything approAching^an ideal system, and this discussion' is thus highly 
appropriate. I note that the discussion takes place in the framework of 
improving access to census, data and I intend to treat that as an overriding 
consideration* * K 

> 

The Task * ft 

*» 

The task of statistical tabulation is, on the face of it, a rather 
mundane programming exercise - one which trainees solve fairly easily, it 
least for straightforward cases, early in their careers, what is involved 
is essentially a mapping, normally many- to- one, from the records in the 
input file to those in the output file. The output file is generally a 
series of n-dimension matrices^ with textual definitions and, descriptors 
attached. That bounds simple enough. But, as those who have worked in 
official statistics know/ the range of problems^ involved in' defining the 
input, selecting appropriate records and items, and manipulating and for- . 
matting the output required for a national census presents a formidable 
task. 

During "tine sixties, many organizations independently undertook, with \ 
varying degrees of success, to produce a generalised solution to the prob- 
lem. The maior difficulties to overcome were those presented by: «. 

. core restrictions A 
. complexity * i 

. the size of the input file f 
. the need*; for machine efficiency 

The solutions proliferated in national statistical offices and 
other organizations. In the case of the Census and Statistics Bureau 
in Australia i/, a generalised table generator was first used in 



1/ "A Generalised Table Generator" L. Ion, Proceedingsof the Fourth 
• Australian Computer Conference, Adelaide, Australia, 1969 



1 204 2|jJ 



prdV<|$|ing 1966 census results, but quickly was applied to many other 
fields 'of statistics. It had a dramatic impact on processing-. Previously, 
kO% of CPU time was consumed by, sorting. With the advent of the generator, 
this dropped to less than 10#. ' Similar results were experienced by other 
national statistical offices. -When the UK Statistical Office decided to 
launch yet another effort in the early seventies, they began by taking an 
inventory of existing "generalised table. generators". Obey stopped when 
the number had passed 100. 

Many of these systems, as well as solving most of the problems above, 
met most of the desirable system objectives, in that they involved a user- 
oriented language, they were capable of producing many tables in one pass . 
of a large file (which could be random-order) and they enabled the produc- 
tion of tables in a limited time from date of specification. The problem 
was solved many times over. 

* *. 

However, when one looks today for a generalised table generator -for 
a non- trivial tabulation task, 'one would have reason to be disappointed 
with the systems available. With each system evaluated, one would find 
one or more of the following problems: 9 

Sise Restrictions : , ♦Many table generators are incapable of producing in a 

single pass more than,- say, 100,000 cells. Some produce two-dimensional 
tables only, some have severe limitations imposed by page\size, others 
limit any dimension* ^o, say, 100 values, and, so oh. Whilst these limi- 
tations are acceptable in many if not most commercial applications, 
they are severely limiting in processing official census results. 

- J . . 

Complex Language : The claims for systems of an "English-like" user 

language are often ludicrous, the language being instead a cryptic 
distorted algebra devel6]3ed without regard to rigorous syntax or 
natural semantics. 

Machine Inefficiencies : One of the objectives of a generalised package '. 
is that it should be at least as efficient in producing a given table 
as a program developed in a compiler language such as Fortran or Cobol,. 
Unfortunately, some generalised systems fall short of this objective 
by an order of magnitude, (it is interesting to note, in fact, the 
incredible range of CPU times consumed in different systems doing the 
same job on the same computer system.) 

Lack, of Portability : Almost all table generators have been .designed 
without regard to portability and are dependent on certain models 
of central computer, specific operating systems or compilers, 
certain device types, etc. A potential user can thus face the v 
impossibly difficult task of redeveloping for his own machine or 
start looking for an alternative. 

In addition to these problems, there is a variety of limitations that 
may hamper the attempt to use a generalised table generator In meeting the 
tabulation needs of a project. There are often restrictions on conditional 



205 • 



211 ' 



manipulations, calculation of sub-totals, percentages, handling of 
floatlng-pdint, footnotes, treatment of "negligible" cells /and niany 
other processes nhich are traditional in official statistical tabA- 
lotions* ' 

ThCv.result is thftt one is required to complement the tise of one 
or more generalised packages with ad hoc programs/ for pre-processing 
data files , post-processing print fil^s and sometimes even for per- 
forming, the tabulation task itself for some tabled 

'. t ' ^ ' 

lhe purpose of this pap^r is to describe the Taefcessary and desirable 
features of a "complete" solution and to examine the feasibility of a; 
project aimed at an "ideal" system. !Ihere will, of course/ always .be- 
some special tablet ion requirements lying outside the z^ealm^of possi- 
bilities of a generalised system, tt?us making words like "complete" in- 
appropriate, bat at least the elimination of major res'tfictions. listed 
above should be a design objective* ' * - 

\ ♦ - r * 

•It is not my intent to perform a comparative evaluation of existing 
systems. Such evaluations are fundamentally affected by the choice of 
criteria and weights, and are often biassed towards an author* 6 own system. 
(However, a fairly objective and carefully circumscribed eyaluatipn is 
given by Francis. et al;£/). > 

An "Ifleal" System 

T 

It has been -stated by/some people ( .;t&at ,it is, impossible to implement 
y an ideal system that will m^$t all the design goals i>pe* might have for a 
single generalised generato^of statist^al7tabula1^ions. A Sljort list olf 
the major goals woulti\ be: ^ " \* ' • 



. ease of use. 



. machine ef fitiencyw J ^ ^ \ 

. applicability to a wide variety of tabulations from Simple to complex. 

V . f 
capable of running on small configurations but taking advantage of bigger 
resources if they are available^^ 

♦ producing "camera- ready" printouts with extensive "ffcrmatting options. 

♦ extensive^data manipulation facilities. *' \* & 

♦ portability. • \ * 

.4 ' 

U • 

' 0 • ' 

2/ "Languages and Programs for Tabulating Data from Surveys" Ivor Francis, - 
^Stephen P.Shermah and Richard M. Heiberger, Proceedings of Computer Science 
and Statistics: ^ Ninth Annual Symposium on the Interface, 19?6# 

v 206 \ 



yith the possible exception of portability, I am of the opinion that 
X sufficient expertise and knowledge of the necessary te&yiiques exist for 
the implementation of a single system meeting all these objectives*, The . 
design of such a system would have, inter alia, the following characteristics: 

. a true compiler- rather than &. table-driyen program, for the sake of . 
flexibility and machine efficiency. v t .\ 

■ * ', *• 

• three major modules - generati.oxi of raw tables, manipulation of tables 
and table print - but calpablfe of use as a single system. ✓ 1 . ■ 

jf ***** ^ % 8 I 

. separate definition of .data*Vtruc,turej content and descriptors j[as in 
the tPL OODEBCfcK'approa^h&f ). , * 

■'■ • v- * « ' , ■■ . ... . - " 

. generation and processing of tagged cell' dat#, rather than in-core tallying 
* of sub- tables - bo^h for generation of raw cells and 'their manipulation - , 
again for 'the sake of machine efficiency. / 

. implementation in a high-level prdgi^am^ing language. 

■ , * 

. a simple but powerful user language with rigorous syntax-. By simplicity \ 

. % is meant that the langiiage should be easily 'learned to a basic level, 

easy to use and' to extend one's comprehension. A' special feature of 

th^ language should be its . power . By power, I mean the amount of work 

onfe can define in a given unit of the language, not the sum of all work 

one can define with the language. 

It is worth reflecting here that machine efficiency must remain a 
primary .dbjective in 4 statistical tabulation. When we are pealing with 
the scale of data files and size of tabulation involved in censXis data 
processing, machine inefficiency can render <**n otherwise Useful package 
impractical. 9 ' , 

'<. f ■> ' ' ■ , 

Other Facilities 

There are three additional facilities .which would make a generalised 
• statistical table generator even more useful, especially from the viewpoint 
of improving access to census results. ttieae are: 

. capability to 1 produce photo- composable output, r Hie output destined for 
the printer can be .saved on. a file which could be input to a generalised 
utility to produce a driver tape for the more commonly- available photo- 
composition devices. Relatively generalised software for this purpose 
is being develoj>ed in the UN Statistical Office. 



2/ "Table Producing Language - .Version 3.5 - Userd Guide 11 July 1975> 
Bureau of Labor Statistics, Washington, D.C. 



207 



y - 



\ 



. . capability to generate large muTti-flioension tables on disk for 
• scanning or "browsing" through an on-line terminal. Such a system 
was developed in the Australian Bureau of Statistics for Foreign 
Trade, statistics. This. avoids the printing of such large ^£ables s , 
f6r which' tfre only purpose is availability for such occasional 
browsing. . S W 

*■ • 

•ability to link with other filfcs. A common requirement of the users . 

4 of census data ip to link with the users' own data for research and 
analysis. Mostexisting generators accept^eitherrcu single file only % 
or at best files yith i4entical format content.' 

* * 1 i * ■ 

f Summary * 7"* " 

' • V ' . 

Over the last decade, there has been considerable investment of 
% time, money and human ingenuity in the^evelopment of statistical table 
generators. Ihey have had differing Bets of design goals and varying 
degrees of success in meeting them. For a typical project, the user 
tends to receive rather different tables than he would prefer. 



There have 'been some attempts, at international cooperation in the 
field ^pf the design of software for processing* of ficial statistics. A 
table generator, has always been a subject* of primary concern. In the 
Working Group on EDP of the Conference of IsSbropean StatjlsticiaAs, such 
discussions in the mid-sixties led to the establishment of a UNDP pro- 
ject in BratislAya, Czechoslovakia in 19^9 • Ttyis extensive (sjeven-year) 
. project was^ very successful as a development project and for stimulating h 
^ discussion and exchange, of ideas on the general subject of official : 
statistical inf ormatipfc . systems, with computer processing as a major }/ 
element. The table generator develop! in this project was, however^ 
no better than some developed, in nstronal offices. Nevertheless/ there 
was a very telling demonstration of portability. . The system was written 
in PASCAL for .the Control Data 3300. At a meeting of the j above-mentioned 
Working Group in Geneva in 19T^> the^ system was re-compiled on the IBM / v 
370/15^ (a machine with quite a < different architecture) and tested and J ^ 
demonstrated within a week. To 'date, however, a PASCAL compiler' exists 
for only a -few machines - but to a certain extent the feasibility of. 
portability of generalised spftware was established. -w*--' 

The most likely way to develop generally-useful software, it^hatf 
seemed to me for some time, would be to fund a project with international ■ 
input, but located in a national, statistical office of an advanced country. 

* The objectives of this Conference are thus of great impdk^tance . For thjp m . 
task of statistical tabulation in particular* I am confident that f a teaifr 

* of* people experienced in statistical data processing could' in a matter of * 
a few years meet the needs" for appropriate jdser-oJriented software/ Such, 
software would greatly enhance the value ofj census data, thus multiplying 
the returns to the considerable investment/made in collecting the data. "* 

° • • / * . * . . 



208 



V 

O 



21 4 



ERIC 



1 Acknowledgement! : .* Ttit author wishes to thank the Director of 
'the UK Statistical, Office, Mr. S. A^. Goldberg, for his ^support anil' 
interest in this paper and his colleagues Messrs. H t Lackner and 
P. Emerson, for their helpful comments and suggestions.. 



■I 1 1 



209 



GENERALIZED TABULATING SYSTEMS AT THE U.S> CENSUS BUREAU V « 



% t By: Melrby Quasney, Chief /.Generalized Software Development 
• Brancb^' Systems* Software -Division,. U.S. Bureau of the Censud 



history of. Computer Language E^velopTpent : 



The development of generalized tabulation systems at the Census Bureau • . 
has followed the normal development patterns of all problem -oriented soft- 

v . ware systems. If is necessary to reflect on the history of. computer langti- 

, * —* • • • ' v " ' 

\ * ages to set the stage to understand the technological advancements, that 

i • ■.*''. 

permitted the development of problem -oriented software. • ' ~ 

• . Stated in simple terms, any new idea has to overcome two major problems • 
■ ' "if the idea' is to be implemented successfully. One, the technology .must 

be developed, 'tested, and *pr oven possible. . Two, the end product toust > 
be accepted by. the intended users of the product. These problems also , 

» * * > ■ * 

apply to the development of-computer languages. 

We began the computer revolutidh with assembler language; it di,d not take 

i ^ ' '* 

• long to. realize that assembler languages were inhuman to the users of the 

computer. Thon came Fortran, followed by COBOL and other higher level 

languages, all making the'computer easier to use to accomplish a given task. 

All of these advancements encountered the two problems previously men- 

tioned. . All of these advancements made the jol) of computer professionals 
• . . # ■ , • •• ' • 

v ' -easier; even though the* acceptance of thismew technology took time. Other 

support software systems were developed to assist the computer profes- ' 

• ... siohals to accomplish their task; however/ task complexity also increased; 

* - » . 

.We are now at the point where the demand for bringing the computer to non- 

" « » ■' *' " ' 

( computer professionals is. upon us.. This demand is leading to the develop- 

ment of problem orie'rited computer* -languages. These systems call for a 

9 -v ' no 2ls . • . » 

ERIC., " ' ••• . 



n 

0 



computer language that addresses a given problem and permit the user 1 
to commuilicate his request in his language. Probably the single biggest * 
technological advancement that has permitted the' demand for and the deve- 
lopment of problem oriented systems has been the access to the computer' . 
via telecommunications. This has permitted the computer user to access 
interactively with computer software systejns, or^submit work from remote 
stations and receive the results backet the remote site. Generalized Sta- 
tistical Tabulation Systems were p rob ably one of the first attempts to pro- 
9 duce a problem -oriented software systems. Systems. like SPSS, CASPER, 
CENTS/COCENTS, TPL, £nd others, all used as their main design objec- 
tive to bring the computer closer to the end user. All of these systems 
contributed to the advancement of the state of the art for permitting non- 
computer professionals, as well as computer professionals to use 'the 
computer to produce statistical tabulations. 



History of Computer Language Development at the Census Bureau 



The. Census Bureau's use of computer languages has paralleled the develop- 

n. i ■ 

ment and use of computer languages; sometimes we have be$n up with the , 
front of the pack, and other times we have been slow in taking advantage of 
the latest technology. We use very little assembler language in the process- 
ing of our production data processing requirement^^ Most production pro- 

• * 
cessing is done using Fortran; however, Algol and COBOL are beginning to 

be used for a^large amount Of the production processing. A more favorable 

point is that most -of our generalized software being -developed is usj.ng Algol 

* ■ N * 

and COBOL. 

211 



0 

i 



2iy 

-v. / 



I ) 



A prerequisite 6^ acceptance for all of generalized software is to develop. \ 
problem -oriented user languages that permit the users 4o state their 
request in a language i$ost familiar to the users. 

• * •• 
Two projects that began fairly close together in time brought the Bureau ,. 

into the world of generalized tabulating systems. One system known as 
GENER70 which began in the-late sixties' and still has some limited use in 
the Bureau. The other project involved the Census Bureau producing a gene- 
"ralized tabulation system for the Department of State's Agency for Inter- 
national Development' (AID) to be used by developing countries to tabulate 
censuses and surveys. This project produced the CENTS/ COCENTS system. 

The CENTS/COCENTS project produced a product that has been installed in 

over 43 countries, and in .over 68 computer installations, and has trained peo- 

• . ,.. '•» .. 

pie from 80 different countries. The system can operate on any IBM 360/370 

•« 

machine, plus 12 other types of mainframe.^ It has been used to tabulate major- 

. .'.«.«• 
censuses and surveys by computer programmers* and subject specialists. 

My reason for emphasizing the 'experience of the -CENTS/ COCENTS project 
is to demonstrate our experience in distributing and supporting software. 
^Ve know the level of resources needed and the problems with using the - 
approach' of distributing software. ' u ' 



112 ' 



SIN 



The malft objective of this project was to produce a product tliat couW do 
censuXes and surveys on small computers and be programmed by bot' 
programmers and subject specialists. These objectives forced\the 
creation of a system that was efficient, but also produced a product that 
received heavy criticism due to its user language being very^srimatiVe. 



A Complete Generalized Statistical System : 

A complete generalized statistical system must be able to-cx>ntrol the collec- 
tion of data, perform editing and imputation of the data, build a data base, 
tabulate the data,- perform statistical analysis of the data* and finally -publfsh 
the data in various forms/ v 

Currently at the Census Bureau the Systems Software Division is designing 
and beginning the implementation of a complete generalized statistical system' 
It is our objective, to produce a system that will service computer professioi 
\ and alsp put the power of the computer into the hands of the subject specialists. 

The planned system consist of six major components: 1) Edit /imputation Sys- 
tem 2) Data Base* Management System, 3) Tabulation System, 4) Math /Stat 

/ .*' 

System, 5) Graphics System and 6) Photo-Composition System. We are 
currently working on the Tabulation System, the Graphics feyste.m, and the 
Photo -Com position System; the Data Base System is being used for some 
projects and will be connected with the other modules now being worked- 
on during' 1978. « ) 



21a 



* Some Problems in Implementing a Generalized Statistical System : 

As previously stated, twcjf major problems face us in completing our total, 
system i 

We still have considerable technical problems to overcome before the 
system is completed. The biggest problem is the details of communications 
between the components." We are designing the components to be independent 
units, but when the data base is introduced it will be used as the primary 

m » 

connection between the' components. Additional control information will also 
have to be passed between the individual systems. 



, Other technical problems are the range of requirements the system must 
satisfy, the' various size of the da'ta'files it must process, and the imple- i 

* mentation of the latest hardware'technology to pfocess large files on-line. » 

/ ' 
An 'ideal statistical system at the Bureau must be all things to all people, 

but simple to use 

The second pro blem is user acceptance; we need the user community to 



ions* 



accept the individual components and to sugply additional specifications* 
to insure that the system can satisfy all of the demands of the user in its. 
, future releases/. However, introducing new technology is not easy. Changes 
to the daily working environment o^a staff can be a hard thing to bring about; 
proving to a s/aff that a new product do a job better takes time. 



214 



f 



A 



220 



• Census Bureau's Generalized Tabulating System (GTS) : 

The Systems Software Division of the "Bureau has completed the first version 
of a tabulating system known as Generalized .Tabulation System (GTS). It is 



important that we explain "why build another tabulation system?" 

Before making the decision to build a tabulating system for the Bureau, we 
evaluated most exis|ingt systems^ and tried to identify the pro and cons of 
each system. We then evaluated the minimum requirements for a first % 
release for use by the Bureau. ^ - 4 

None of the existing tabulation systems evaluated could solve the wicje range of 
the ]ftureau r g tabulation requirement^ None, at the tj^me, wer^^perational 
on Uniyac equipment. But mos^f al\^ test showed that the basic tabulation 
J strategy of the. CENTS/COCENTS system was more efficient. It waskihen 

* decided to build our own system using thesjfe prpveij efficient methods, but 

'r ■ * 

to also place major emphasis on producing a user language that is consistent 
with the termiriplogy and method of operation used in the Bureau and is easy 
< for thfe computer professionals and subject specialists to^specify their tabu- 

lation requiremsn'^to the system. } 

. . '. \ ■ .> . • - . ■ • 

v • -j ■ ■ - ■ 

The Le\ast Cbiapnon Denominator Approach (LCD) : 

• LCD approach permits the user to specify the smallest geographic level 

for Which a table is to be displayed. Several tables can be tabulated at the 

.'«.•*'....'•'.."' ' ' . . 

same time, 'eacIT with a-different level of LCD being specified.. This approach 

permits the minimum ^amount of hardware resources to be allocated during 

- • .•"«*..• . 

*\s ■ ■ ' 

the long' eomputer "runs that Require the examination of millions of detail data 
. * • 215 

'/ " S ' } • / • \ ' . • 

221 

ERIC * • . 



records. This approach. also permits hundreds of tables to be produced 

with one pass of the detail data. 

y * . . % 

• ' .» 

After the largest part of the processing has. been completed, GTS then 
uses the LCD blocks to build all higher levels required ftr display. 

€ r • * 

A cost comparison was done by DUAL Labs and demonstrates the efficiencies 
of the LCD approach. A file containing 17, 958 records was used to tabu- 
' late a table containing 56 rows by 2 columns. DPS, Data-text Nurcros, 
SPSS, and CENTS were the packages selected for the test. CENTS produced 

0 

the table in 18. 70 cpu seconds at a cost of $3. 92. The next closest system 
was SPSS using 42. 94 cpu seconds at a cost of $16. 86. The most expensive 
system was Data-text at 107. 62 cpu seconds and cost $41. 96. DUAL then 
took SPSS and CENTS for additional testing. Two file sizes were selected 
for the* test: 180, 047 and 1, 799, 888. When tabulating 180, 047 records, SPSS 
used 459. 46 -cpu seconds and cost $89. 38; CENTS used 92. 36 cpu seconds 
an(%cost $19. 00. Based on this test, only CENTS was chosen to tabulate the 
file with 1, 799, 888 records. If took 815. 32 cpu seconds and cost $150/. 00 
for CENTS to do the requested task. 

BLS using their TPL system tabulated a file with 20, 196 detail records^nd 

if • * 

produced the same table that was used iri the DUAL test. It took TPL 40. 46 

cpu seconds as compared to CENTS tabulating 17,9^8 detail records and 

using 18. 70 cpu seconds. 

This kind of- efficiency must not be ignored when bYiilding a tabulation system „ 

* ■ 

that will be used tabulate millions and millions of detail records for the 
Census Bureau. This method of process is" also compatible with getting tVe 
tally matrices under the control of a DBMS. 



The Bureau also capitalized on utilising, its available resources; it had 

, .... '. 

the. staff who" built the CENTS / COCENTS system available to work on build- 
ing tin efficient system for the Bureau. 

i 

The first version of GTS has been completed and attached are some test 
results to show* that we have again built a system that is efficient to use. 
users of the system to produce these, tabulations were.subject matter spe- 

cialists. The totjtl project was completed in one-fourth the time conven- 

1 . 

tional processing methods would have taken. 

1 j / * 

* *• 

* . V 4 

9 

Overall GTS Design Requirements : .V , 

Five major objectives were selected to act as a guiding force for the develop- 
ment of the GTS system ♦ - 

1* Bridge the conflict between being easy to use and powerful/ 
' j 2. Function in a conversational as well as a batch mode. 

3. Exploit the availability of large core storage on the UNIVAC 1100, 

4. Maintain consistency in recoding of the input data. ■ 

5. Maintain flexibility without lost of machine efficiency. 
Evaluation of other table generator system was performed and some features 
of these systems were incorporated into GTS^ A continuing effort to keep 
track of other systems will be done* * .* 

... '■ ■ * : 

Evaluation. of data dictionary concepts has been done; the first version of 
GTS uses a stand-alone data dictionary processor. The design of the dic- 
tionary language is allowing for ttTe future connection into Univac's DMS-1100 
data base management systenu 

217 



GTS will be implemented in phases of capabilities. GTS-1 is now complete 

and GTS-2 is beginning the detail design phase. 

. y 

' ' ' '■ 

,0. 

The GTS System : ■ 

Attached is a system overview of the GTS system. GTS is designed to con- 
sist of three major segments.* They are: 1) the User ProcV$sors; 2) the* ■ 
Execute Processors; and 3) the Display Processors. -The User Processor 
is the only part of the system addressed by the users oY the system. This 
provides us with the flexibility to design different user languages; and as 
long as these different languages follow the' rules for passing control infor- 
mation to the Execute and the Display Proc^es^pors, several user views of 
the system is possible. The Execute and Display Processors are designed 
with efficiency and simplicity as the main design goals. Any decisions •/ 
thai; can be made by the Language Processors are made by them. 

GTS-1 Design Objectives and Status : 

k The main criticism Of CENTS/COCENTS was that the user language was 
too primaAve and resembled a form of assembler language. When design- 
ing a table generator to run on a computer with 25K Of working core and a 
CPU that is slow as molasses on a cold day, major emphasis was placed 
on^fficiency of running and on flexibility to produce publication output. 

'The price was in the user interface. It should be obvious then that one of 
the main design objectives of GTS-1 was to produce a good Census Bureau 

f 

compatible user language. 

* » 

218 



4 

The second objective was to^egin, and experiment with a data dictionary 

to desribe and control input to the system. * *' 

'•» . • '. ' 

It was decided to use a computer language that would be as portable as t 

possible to permit the Bureau to change hardware and software with mini- 
ms ' 

mal impact on GTS. A by-product of this decision permits the first two 

a 

versions of GTS to be usable by other computer installations with a mini- 
mum of resources to adapt the system to .a different environment. Using 
a higher level language also has advantages in the implementation and 
de^uggmg of this and future versions of GTS. Unique hardware and soft- 
war£ features of the Univac 11G0 serle^j^stems were pur poselv^not used 
in the first version of GTS. We wanted to maintain hardware /software . 
independence so that converting GTS-1 toother computer systems' would 
be an easy task. - * 

The technical specifications of the system were distributed to the entire 

o - • •** ■ 

Bureau user community for comment. This was successful in that several 
critical design changes were incorporated during the implementation phase. 
Test projects using the system also resulted in design changes that were 
incorporated in the first version of the system. 

It was of course necessary to maintain, or improve, ttie efficiency achieved 
with the CENTS/ COCENTS system. The system demonstrated that our 
basic .design strategies were proven to be efficient during the 1974 Ag 

Census Volume n test project, / 

» 

The last objective to be discussed is the requirement that GTS must be 
f capable of utilizing a Checkpoint/Restart facility. The attachment showing 

219 y 



^xamples of the cost of some runs ofi the 1974 Ag Census Volume II project 
points out the reason for this to be mandatory to GTS^ Th^se large produc- 
tion runs were on the computer system 6 to 27 wall clock hours. The Bureau's 
computer systems are only averaging. 12 hours meantime between system 
crashes. In this environment GTS must perform restart recovery. 

All of the above objectives have been*met in GTS-1.- The first level of 
the system was completed in May 1977. Enhancements and error cortfec- 
tions have been made and the final GTS-1 was competed in October 1977. 
Final user documentation was completed 4n Octobe/ 1977; training work- 
shops will begin in December 1977. •' - 

GTS-2 Design Objectives and' Status ; 

Major emphasis in GTS-2 will be placed on the data dictionary capabilities 

t 

of the system. The major objectives of this ^effort will try to address the' 
following problems: 

X A. Ability to store recode commands ♦ 

B. Ability to store headings and stubs connected to related 

'* \ 
stored recode commands* 

C. Ability to store calculations. 

D. Additional automatic documentation of data in dictionary 1 . 

E. Recode scale checking to validate recode commands, 

« 

F. Validation of a data file against the dictionary describing 

the data file # • } ' 

G. Access to build and use dictionary from a conversational 

mode. 1 

* .220 * ' 



Other major design enhancements to GTS-2 will include: 
.A. Conversational capability. 

B. Ability to process overlapping geographic areas in one 

* pass of data. * ' . * 

C Expanded statistical capabilities. 

D. Improve method to process economic data when displaying 

data greater than four positions of the Standard Industry Code (SIC). 

E. Random retrieval of geogragjiic and SIC stub descriptors. 

. Dynamic allocation of core and I/O paging to accomplish * \ 

Current task. . 

* ■ ■ ■ 

G. Provide linkage to user 1 programmers. ' 

H. Qapture information for Math /Stat .package* 

" . ■ . • - 

* ■ 

I. Begin connection to Graphics and Photo-Comp software. .. 

' .' ■* i 

The design phase of,GTS-2 began'in November, X977 and wil-l £e completed 
by January 1978.. Implementation of GTS-2 is targeted" for May, 1^78. 



GTS-3 Design Goals : 

GTS-3 will concentrate on the connecting into the database management 
system. This will require .GTS- to use theNDBMS's data dictionary and 
access data through the DBMS. 



'r> 

I 



*■ 



221 



Distribution of Tabulation Software : ' ^ 

» ■ ' y 

The CENTS/COCENTS project has given thp Census Bureau considerable 

experie*nce>dth the problems of distributing talkie generator software. 

" i 

As 'previously stated, the CENTS/COCENTS system could run on any 
IBJW 1360/370 hardware and DOS, QS-MFT, MFT, and VS operating systems. 
It could also run on 12 other types of hardware with their associated soft- 
ware systems. 

Experience has taught us that the only way software can be distributed suc- 
cessfully is to actually test the software on t;he target system. This involves 
buying computer time and supporting a staff in the field to install and check- 

» . 

' out the software. If this is successful, the software must then be packaged 
to be as self- installing^ possible.* This process also requires testing to be 
done on the target system. 

Experience(also taught us that two types of training are required. A com- 
puter professional must be trained and made responsible for supporting the 
system at each installation where the system is installed. The second type 
of training involves training the intended users of the software system. The 
Bureau found that the best way' of accomplishing this process was to send 
technicians to the installation to install the system and do. the necessary 

* • " i /■ - 

training* 

■ > 

Another big problem with distributing support software is the multi -types 
of documentation. The basic documentation for using the system is the 

t 

same. However, additional support documentation was always necessary for 

. • 222 , 




228 



each unique environment for. which the system Was supported. 

* : * > ... r. : - 4 

The last problem was testing new ver'sionfe of the system in all of the 

environments for which it is supported. This, requires repeating the 

■ 4 > '" > , . • % i.' 

process of testing the\system in all" of the environments supported. It 
also involves changing all affected documentation. The last phase pfc.~ 

$iis procesk requires the distribution of the 'new sof tware and documeh 

.••*■ ■' -> ■ . \ ■ :'. ■ /• , 

O » • ■ V . 

tation to all computer installations where the system was previ6usly 
installed. In some cases this coulcj meaj* that -retraining' must °be done. 



**■ <►'. . . * " . 

•0 



.This total- effort requires tremeri'daus, mati -'power and computer resources 
this can be translated to a great deal of money. f These resources must be 



allocated to the organization where the software 'is being developed and in 

* ■ V ' . , > , •■ ■ * 

each installation where the system is installed and being used. 



ic ojoiciu xa xxietctxxcu cuiu j^cing uscu. 1 



lf 4 the Bureau is to. consider distributing GTS along with Census Bureau •'■ 
data files the problem becomes more complicated when GTS-S^connects 
to DMS-1100. This forces the Bureau to keep a- subset of GTS that only 
processes, flat data files. It may' also have the impact of reducing 1 tabu- 
lation capabilities of the GTS system that is being distributed. 



A . 'i 



This distribution problem, becomes more difficult because GTS is being 



1 



designed as am ajor part of an integrated statistical system. 



If the total statistical system ia to becbme & flexible and integrated one, 
then it must use to full advantage the hardware /software facilities in 
the environment in which it is to function. 

223 



Need for Deteijion: > • % V 

• •• * 3 * " • • 

GTS-"^ is technically very mobile ;\howeVer, 'GTS^and beyond will become 
^difficult. At some point distribution may not be possible. " 



K » 



Can the users of cmr c(ata somehow use .the system we are building? 

Can we produce an environment that will permit access to this data and 
and softw^rte ai> a reasonable cost? : ^ — r , 

ff ✓ ♦ • * J / 

\ ■ ( / 

Now is the time to determine if the total' statistical system, or par^s of 

the aystjjin^ 0 should 4e allowed access by non-bureau personnel. If it is 

• - 
to b'e accessed by non-bureau personnel, we must define how it should be 



( 




T A B L E Q E N E R A T 0 R 0 V E R V I 




W 




USER 




DEMAND/BATtt 


DICTIONARY 




USER 


PROCESSOR 




LANGUAGE. " 






PROCESSOR 

ar 



(USER ENTRY 



ySEB PROCESSORS 



CONTROL 
DATA FOR 
EXECUTE 
ROCESSOR 



EXECUTE PROCESSORS 



INPUT 

DATA 

FILE 




OUTPUT 

TALLY 

FILE 





GRAPHIC 
PROCESSOR 



DISPLAY PROGESSQRS k 



ERJC 



.231 



f 



1'IEWGRAPH « 2 



USlksUnpub Tablu 



.CPU 7/0 Woid& CoAt 

Total 721 ,000 300 ,000 ,000 $ 5300.00 



A. >Co*t p'vi table.: $ 5300 f T?35 « -$ * 3.45 

8. Coa* pe*. ce^e* $ 5300 t J ,728 ,000 - $ 0.03/ 

C. Co&t pax Tam: $'530,0 * 1,981,578 * $ 0. 0f)?6 
P. Total \il<L contained I, ,98 J ,578 fiaJtmA 



» 



F. To^aT. Wo££ Ctocfc* Time '« 27 ftouAA J 2 -mm 38 Atcond* 



1974 l/o-fume IT TaMea 
Qfioap #3 

CPU I/<? Wo*rfA Co6t' 

Total 6 J HPS 473 ,032 292}949 ,768 $ 2264.00 

A. Co.** pe4 <ab£e: $ 2264 * 2930 . * * $ - 0*77 

H. CoA* pe* ce£f ; $ 2264 / ?44 ,nOfl( * $ 0.0092 

C. CoU p<?A V-aAm : $ 2264 f 1 fi&1 ,578 « $ 0.00U* 
P. To.ta£ lilt coYitalviQ.d 1 ,98/ ,578 ^aAm. 

E. To*a£ U'ate ciocf? Tim v f f»ou*A 18 itvckia 58 aeconrk 



226 ^ 



232 



Reference Materials JJsed by Speakers 
at the Data Presentation Group 

• \ 



IAMS. 



IBM, Research Division, San Jose, California, in discussing 

Interacting wi^ti Data via Computer Graphics* used the 

following three previously-published papers: ■. * 

1. P. R. Mantey, J. L. Bennett a/id E. D. Carlson. 
Information for Problem Solving: the development 
of an Interactive Geographic Information System. 
Proc. IEEE International Conference on Communica- 
tions. June 11-13- rfl3 9 Vol. II, Seattle, 
Washington. Available from IEEE. 



2. D. Weller, R. Williams. Graphic and Relational 
Database support for problem solving. Proc. 
SIGGRAPH '76. Available from ACM, SIGGRAPH, rin 
Computer Graphics, Vol. 10, No. 2, Summer f 76, 
pp. 183-189. 

3. E. D. Carlson, G. Ml Giddlngs and R. Williams. 
Multiple colors and Image Mixing in Graphics 
Terminals. Proc. IFIP Congress f 77, Toronto, 
Canada. Published by North Holland Pub. Co., 
pp. 179-182. . « ' 



7 



E. CORNISH . U.S. Bureau of the Census, for his discussion used part 

■ 4 

? of an unpublished feasibility study by the "GRAPHICS AND 
PUBLICATIONS" Subcommittee of the "EDP REQUIREMENTS" 
'group of the U.S. Bureau of the Census. The study was 
concluded in August of 1977. *. ' 

V 



) 



227 

• 233 



4« 

< MATERIALS PREPARED^FOR SUB-GI&OUP DISCUSSIONS 



Materials Prepared fdrUfhe Data Presentation Group 

ly 

Shirley Gilbert,\Princeton/Rutgers 
Census Data Project, Princeton University s 



The results of the survey of Summary Tape Processing Centers conducted 
by the Bureau of the-«ensus and repotted in the July 1977 Data User News 
clearly indicate a need- for software support for processing 1980, census data 
ta^es. How this need should be met in terms of specific program abilities 
to retrieve data axid provide flexible report formats is important. Equally 
important, it seems to me, is consideration of how the production and distri- 
bution of this software will be implemented* 

' —4 

The Census Bureau's primary function in the area of user services should 
be to provide clean, well-documented data as promptly as possible* Oftc? the 
dataware delivered the function should be to inform data processors of 
problems infuse of the data as soon as these problems beteome known* To ask 
the Bureau itself to write software compatible with' the hardware of the great 
variety of ' computers serving Summary Tape Processing Centers is unreasonably. 
This conference can veryybee fully address the problems of how and by whom 
software can be produced ariffl evaluated outside the Bureau in 'suqh a way that 
the Bureau can advise users Jof the availability of software for any particular 
system. 

_As a first step, I would like to see the members of this conference 
designate a committee composed of persons familiar with computer systems used 
by potential da|a processing centers! This committee could explore: 

(a) How best to develop software where none now exists. (The 
^most efficient procedure may not be* the same for each of ^ 

the several computer systems) . ^ * 

> 

(b) How to evaluate programs so that the Bureau can make 

* 

recommendations to potential users. • y 



428 j 

234 



V 

. TH& djpNERATIVE APPROACH TO SOFTWARE DEVELOPMENT # 
4 , ' * Gary L. Hill . N * - 

JDirector, Information Systems, CACI, Jnp. - Federal . • 
/ ABSTRACT J 

The National Institute of Child Health and Human Development 
(NICHD/NIH) provided funding for the analysis of unique data processing 
problems posed by large statistical dataAiles. One mechanism that resulted 
fromlthis activity was the CENTS-AIDp system, which reduces thecost of 
accessing large data files by as much at 80%. The generative programming 
techniques designed into the system are Responsible for this significant cdst 
reduction, CENTS- AID II is currently being used in over 50 computer sites 
around thfe world including the Belgian Archives, University of Heidelberg, 
Prudential Insurance Company, Congressionkl Budget Office, Social Security 
Administration, National Institute of Heilth, and the New York State 
Workmen's Compensation Board. The system is operational on the IBM 
360/370 under OS and DOS, * 

.w 

\ 

1. INTRODUCTION: The Problem 

M 

Most generalized statistical access syste^l used by today's academic 
community were designed using interpretive programming techniques. That 
is, they were designed to scan researchers' commands and build extensive 
logic tables. Subsequently, as each record from the data file is processed, 
the contents of the logic tables are, scanned and interpreted to control the 
execution of specific preprogrammed functions which will yi^ld the outputs 
requested. As the research community developed new statistical routines, 
additional preprogrammed functions were integrated with /minimal 
modifipatiorisPtb the basic processing methodology \of the logic tables. As a 
result, the most popular generalize^ systems include a variety of analytic^ 
capabilities and require more than 200,000 bytes of\core storage to execute. 
Even though logic tables are continuously Scaled for each record on a file, 
and large segments of core storage must be 'allocated for execution, 
interpretive programming techniques offer an efficient mechanism * for 
analyzing a limited set of observations. The same interpretive techniques do 
not however, offer an efficient mechanism for analyzing large statistical 
data files. 



Large data producers such as the federal government provide a continuous 
flow of computerised statistical data. Most* of these files contain tens-of- 
thousands, hundreds~of~thpusands, or millions of records. Further, many of 
these sequential files are organized in a hierarchical, or tree stricture 
format. This type of file organization provides for the definition of one or 
more record formats describing -different units of analysW For example, a 
file may contain one record format to describe the characteristics of 
households, another to describe persons, and a third to describe purchases. 



"Material submitted for the Sub-group on Tabulation. 

V S29 



235 



Additional valuable tiata relationships are defined by arranging the records 
in a predetermined order (tree structure)} purchase records immediately 
follow the person record responsible for the purchase, and person records 
follow. the household record in which they reside. Such affile provides 
researchers the opportunity to analyze the characteristics of purchases, the 
characteristics of people, and the characteristics of households* Further, 
the jile enables researchers to analyze the characteristics of purchases with 
the /characteristics of people,, the characteristics of purchases with those of 
households, and the characteristics of purchases with those of people and 
thbsa of households, et cetera, through all Combinations and permutations of 
purchases, people, and household characteristics* 
I 

The analytic potential afforded by this type of file structure far exceeds the 
capacity of the punched card concept of file organization where each file 
hjas a single unit of analysis expressed in one record format. Unfortunately, 
most statistical access systems utilizing interpretive programming 
tjecnnolpgy Still require data to be organized as if they were in punched 
Cards. In order for researchers to access* the larger, more sophisticated 
tiles, data must first be reorganized to suit the unique specifications of the 
Software system being used. This process is not only* costly, but often 
lestroys valuable data relationships defined by the original structure of the 
file. Whereas the utilization of interpretive programming, techniques has 
tended to promote the general use of computers by the research community, 
it has also tended to limit access to large files. 

.The National Institute of Child Health * and Human Development 
(NICHD/NIH) became increasingly concerned that many valuable >data 
resources were being, under-utilized by the. research community. 
Consequently, funding was provided for the analysis of the unique data 
processing problems posed by large statistical files. One of the mechanisms 
that resulted from this' activity was the high-speed CENTS-AID II System, 
hereinafter referred to as, CENTS- AID. 

2. CENTS- AID: The Generative Approach '. 

CENTS-AID (keiease 3.0) is specifically engineered to minimize the cost of 
accessing large -data files through \ the use of generative programming 
technology. In benchmark comparisons with another widely used system 
designed around interpretive programming techniques, CENTS- AID f § 
generative approach reduced computer ^osts py over 80%. Based upon user 
prepared comniands, CENTS- AID generates a tailored ANS-COBOL program 
to process and analyze the data file. Subsequent system modules are used to 
format and display cross-tabulations of up to eight dimensions, produce 
' subfile extracts complete 'with self-documented computer-readable Data 
Base Dictiohary (DBD), t generate and display correlation and covariance 
matrices, and create* an SPSS (Statistical Package tor the Social Sciences) 
Correlation Interface File upon request* v 




s 



The CENTS- AID system is comprised of seven programmed modules, three' 
standard utility sorts, and the ANS-COBOL compiler and loader; The ^ 
system's generative approach can best be explained by examining the 
schematic diagram displayed as Figure 1 on the follpwing page* The diagram 
does not depict each of tHfe system's ^modules; instead it is intended , to 
portrayjthe system's generative^nature. 1 



2.1 Fragment Generation; Describing an application in quasi-JEnlglish • 
language # commands, the* user interfaces solely with the Fragment 
Generation module of tfre system* This module performs format ^(nd syntax 
checks on all commands, building a variety of interna^ tables, and organizing 
descriptive labels for subsequent report presentation. Once all command! 
are validated, the module scans the internal fables ONCE , building 
fragments of a COBOL program. These fragments /are then combihed with 
information from the CENTS-AID Models File to £reate a complete ANS- 
COBOL program specifically tailored to the application request. 



When an application includes a request to generate a subfile extract, the 
Fragm en tv Generation module will automatically create and display a 
computetH^eadabJfe Data Base Dictionary (DBI}) containing all detailed 
technical characteristics of the new data file, ai) well as descriptive labels 
for all varmblis and values of variables. The < computer-readable DBD is 
separate fromZtfie new subfile .extract itself and can be placed on any direct 
access storaflBeVice or alterirati^ely, as a separate file on a magnetic tape* 
The Application tnodule-of CENTS-AID, to be described later, will actually 
generate the subfile extract according to tfie technical characteristics 
contained on the •DBD. Subsequently, should fhe user wish to .also analyze 
the subfile extract through CENTS- AID, all $omput£r-oriented technical 
information and descriptive labels are automatically included through 
reference to the subfile's Data Base Dictionary. Alternatively, users can 
document master data files through the facilities of the Lexicographer 
component whose sole function is to generate computer-readable Data Base 
Dictionaries. This one-time documentation activity reduces the amount of 
technical knowledge required of statistical data users, and minimizes the 
amount of coding required to describe applications^ . 

For user applications that require the generation of cross^4abulartions r the 
Fragment Generation module is responsible for creating COBOL fragments 
that dimension all tabulation matrices Requested. The facility of 
dimensioning tailor-made matrices into the generated ANS-COBOL program 
contributes to the overall processing efficiency of the CENTS- AID system. 
There is virtually no limit to the number of tabulations that can be 
requested in a single application. However, ho single table may exceed 17 
columns, or 999 rows, or 8008 matrix cells. "Matrix cells can b$ incremented 
by a simple frequency, count (1) or by the value of an observation variable 
such as income, expenditures, ^ge, or number of live births. In order for the 
Fragment Generation module to dimension each .table, ]the user must supply 
the minimum and maximum numeric values of each variable to be/ncluded 
in the table, either through CENTS-AID commands or via the DBD* Simple 
data transformation commands are available to manipulate variables 

9 \ 

231 . 

. - f 



, • O o * 



237 



containing alphanumeric or noncontiguous coding structures* Since each . 

matrix shell is specifically tailored to (accommodate the requirements of an 
.application, CENTS- AID only reserves the amount of core storage actually 

needed to analyze the data file and perform the tabulations* In many 
' computer billing algorithms, core" storage costs are significant so tjiat by 

reducing core requirements, computer processing costs can be minimized 

further. 

CENTS- AID can also be requested to perform correlation analysis* generate H 
variance/covariance matrices, and create ^ variety of other statistical 
measures* In those instances, • the Fragment Generation module is 
responsible for creating COBOL fragments that define working storage* areas 
and logic routines for the ANS-COBOL iprogram to compute intermediate 
statistics for pairs of X and Y variablesjfwhich will subsequently be processed 
by the Statisticaf^eneration module. The working storage areas and logic 
routines are specifically designed to eliminate statistical error caused by 
accessing large gata files* The intermediate' statistics include the number of 
observations, the number of mis^jng values, the sum of X and Y variables the 
sum of XY, and the sum of XY, All computations are performed in double 
.precision floating point. 

♦The COBOL fragments generated are then combined with instruction format 
information from CENTS- AID's Models File}' reference Figure 1, to create a 
complete ANS-COBOL program. In a matter of Seconds* CENTS-AID 
generates a tailor-made ANS-COBOL program designed t|> the specific 
requirements of the user* 

2*2 Application: Under the control of Job Control Language (JCL), the 
ANS-COBOL compiler and loader compiles and executes the Application 
program created by the Fragment Generation /module. The resulting 
prdgram is the only module within CENTS- AID that analyzes the statistical 
data file* Since the [Application module is tailor-made to the specific 
requirements of the user, processing logic is optimized and core storage 
requirements are minimized* Because of the generative characteristics of 
CENTS-AID, most data files do not have to be reformatted in order to be 
analyzed* The Application module will directly process simple and complex 
sequential file structures whose records are fixed or variable length* Files 
can have up to twenty-six different record formats and a hierarchical 
structure of up to thirty levels, data c^n be recorded in binary, packed- 
decimal, and EBCDIC/BCD formats. 

In adcTJtttm to the basic generative characteristics of CENTS- AID, the 
processing methogoldgy integrated into the Application module to update, or 
increment, matrix cells for cross-tabulations is also a maj6r factor 
contributing to\he efficiency of the system* Instead of continually scanning 
matrix dimensions to determine the proper matrix cell to increment (a 
technique employed by most systems), CENTS- AID uses the actual code 
values of the data file to compute "pointers'* into each matrix. Simplified, 
the algorithm used to computf the "pointers" fbr a two-way table i& as 
follpws: / , - * 



POINTER = (Code Value - Minimum Value) + 1 

233 



To illustrate the technique, suppose a user has requested the generation of 
a simple two-way tabulation (Sex by Marital Status); where Sex contains 
two code values (0 and 1), and Marital Status contains five code values (3, 
4/5, 6, and 7)* A record containing a valu^ of 1 for Sex and a value of 5 for 
Marital Status immediately points to the matrix intersection of (2/3): 



/ 



ROW POINTER =(1-0) +1 = 2 
COLUMN POINTER = (5 - 3) + 1 = 3 



ROW 
POINTER 



i. 



V 



COLUMN 
POINTER 









—I— 


a s 

— \ 






W/M 




3 


4 


r 1 • ! 7 1 

*TTAl STATUS 


CODE 
' VALUE 



The processing logic: of the Application module functions according to the 
specific requirements"" of ' tha user's application. If a subfile extract- is 
requested, records are formatted and" written to an dbtput file as the 
statistical data file is being processed. After the data file has been 
completely analyzed, the Application module then generates a Summary 
Tally File containing data for ali cross-tabulations requested, as well as an 
Intef mediate Statistical File. These smaller fires are subsequently 
processed by the Table Formatting and Statistical Genefration modules. • 

2*3 Table Formatting: The Table Formatting module is invoked solely far 
those applications requesting tabular output. The module combines the 
d£scriptivjp labels organized by the Fragment Generation module with the 
.content of the Summary Tally File generated by the Application module. 
The module also computes column and row totals, as well as apy optional 
descriptive 1 statistics requested .such as percent, mean; median, variance, 
and chi-square. The table formatting capabilities of CENTS-AID are 
extensive. Users can request simple frequency; counts of selected 
variables, as well as more sophisticated cross-tabulations of up to *eight 
dimensions. The TABLE command is used to identify the variables to be 
used in each tabulation. Variables named to the^left of the keyword BY 
comprise row variables, whereas variables named to the right comprise 
column variables. The following TABLE command defined the siy-way 
tabulation displayed asr Figure 2 on the next page. 4 



TABLE PLACE AND RACE AND INCGRP'BY EMPST AND AGEGRPAND SE*X 



234 



2-U) 



Table T007: 



PLACE OF RESIDENCE AND RACE AND INCOME^ GROUP BY EMPLOYED 
AND AGE GROUP AND SEX 



PLACE OF RESIDENCE 
RACE 
INCOME GROUP 

URBAN 

WHITE 
$0 40 $*,999 
$5,1100 TO $9*999 
$10,000 ANO OVER 

BLACK 
$0 TO $*,999 
$5»000 T0 t $9,999 
$10,000 ANO OVER 

OTHER 
$0 TO Tfc,9^9 
$5,000 TO $9,999 
$10,000 ANO OVER 
SUB TOTAL URBAN 

RURAL 

WHITE 
$0 TO $<f»999 
$5,000 TO $9,999 
MO, COO AND OVER 

BLACK 
$0 TO^ $4,99$ 
$5,000 TO $9,999 
$|CiOOO AND OVER 

6THER 
$0 TO $*,999 
65,000 TO $9,999 
$10 9 0 00 ANO oven 
SUB TOTAL RURAL 

TOTAL. 



EMPLOYED 



IS TC 35 

SEX 

MAI4P | FEMALE 



YES 
AGE GROUP ' 

I OVER 35 

I SEX 
I MALE I FEMALE 



NO 

AGE GROUP 
16 TO 35 I OVER 35 

SEX I SEX > 

MALE I FEMALE I HALE | FEMALE 



TOTAL 



















'475 


549 


402 


654 


■ *# 264 


612 


339 


1#T74 


510 


2C6 


676 


331 


24 


16 


36 


20 


261 


14 


699 


53 


' 6 


2 


20 


2 


75 


06 


74 


121 


44 


136 


60 


167 


63 


39 


62 




6 


3 


6 


3 


• 


1 


*e 


n 










14 


6 


s 


6 


9 


14 


2 


13 


5 


3 


7 


t 


1 








2 




4 






*» 






1,433^ 


906 


u?eo 


1,221 


354 


967 


667 


1 #981 


















166 


190 


317 


311 


66 


. 316 


267 


\ 1 764 


199 


46 


291 


69 


1 


4 


10 


3 


66 


2 


157 


7 


3 




6 




26 


16 


36 


26 


11- 


25 


23 


* 39 


6 


* i 


9 


1 


1 




1 








2 


1 


m 








4 


3 


4 


6 


3 


10 


■ i 


9 . 


3 


3 


5 


3 








1 


,3 
>93 


\26l 


5 

133 


444 


107 


333 


333 


,636 


It 926 


ltY*T 


2,113 


1,663 


441 


It 342 


1»022 


2,617 



5,469 
1,623 
1,077 

763 

250 

. *° 
76 

6 

9,549 



2,459 
650 
241 

222 
H 
3 

4t 
15 
# 

3,664 
13,213 



erJc 



/ 

■ / 

•/ - 

/ ■ • 1 



Figure 



241 



Descriptive labels were obtained from the com pu tec-readable DBD. ~The 
Fragment Generation module analyzed thfe minimum and maximurii values 
for all \ix variables referenced in the TABLE command* It then adjusted *he 
"pointer 1 algorithm to automatically provide for the "nesting" of row and 
column variables, as well as align all row and column labels for subsequent 
display. ^ 

2.4 Statistical Generation: The Statistical Generation module is executed 
for applications requesting special statistical analysis such as Pearson^ 
Correlation. l*he* module processes the Intermediate Statistical File 
generated by the Application module and produces a v^iety of optional 
reports including correlation analysis with list-wise or pair-wise deletion, 
and summary reports containing such statistics as means, standard 
deviations, sums of squares, sums of cross-products, the Tmimber of 
observations, and the number of missing' values, tn addition, the module can 
optionally generate ail SPSS Correlation K1^|^face File. This file is 
acceptable to SPSS (version 6.0) as original input* to its library of statistical 
.functions which' manipulate correlation matrices. 

3. PROCESSING EFFICIENCY: A Comparison - 

CENTS-AID is engineered specifically to minimize computer processing 
costs for accessing large statistical data files. The generative techniques 
employed in CENTS- AID do not necessarily produce a cost effective 
mechanism for processing small data files.** ^ series of benchmark tests 
designed to demonstrate the effect of processing increasingly larger volumes 
of data on CENTS-AID's generative approach and anc^ier . system's 
interpretive approach were conducted. Although we feel that it is 
unrealistic to compare generalized systems that are designed for different 
purposes, we chose the Statistical Package for .the Social Sciences (SPSS) for 
this comparison because it is so widely used* The benchmarks were 'not 
intended to be a comprehensive evaluation of the merits of the two systems. 
Whereas CENTS- AID is specifically designed to access largfe data files', SPSS 
offers a wide range of statistical analysis capabilities that far exceed the 
purrent facilities of CENTS-AID. The benchmark tests wei*e designed by an 
QUtside consultant to meet the following specifications: 1) the test must 
request statistics which both systems could generate; and 2) it must use 
SPSS as efficiently as possible. The benchmark application used the 
FASTABS option of SPSS (version 6.0). The 1970 Public Use Sample Files 
were processed. The results of the test are presented in the following table. 



* 236 



#- 




v BENCHMARK ,T£$T (IBM 360 Mw 


T" 

tteft) 






TEST \ 


TEST 2 


TEST 3 


4 

* 


sm 


CENTS-AID 


tns 

(5.0) . 


CENTS-AID 


SPSS 
(5.01 


CENT$.AID 


Number of Input Records 
Size of Universe 
Number of Variable* 
CHJ # Time (Seconds) 
Core Storage 
Dollar Cost 


27,591 
5442 
9 

119.59 ,-■ 
214 . 
$45.99 


27.591 
5442 
• 

3129 
94 
$1078 


* 277.723 
54.741 
9 

, 1188,17 
214 
$175.74 


27>,723 
n 64,741 

$24.48 


f,71 9.*49 
537.567 
9 

1188a 00 
■ 214 
$1543.04 


2,719.249 
537,657 
9 

1113.16 
94 

$111.03 



The comparative statistics generated by the three benchmark tests show 
that, as the volume of data increases, the computer, cost of .performing 
tabulations with software systems using interpretive programming 
techniques can become almost prohibitive. Subsequent to the execution of 
the formal benchmarks, further analysis of the processing efficiencies of the 
two systems was 'undertaken. For Example, each system generated multiple 
tables using various combinations of user commands. Throughout these tests 
the variation in relative processing efficiencies remained consistent, with 
CENTS-Aip applications costing approximately 80% less than the SPSS runs. 
During the testing process, an SPSS SYSTEMS FILE was created which 
substantially reduced SPSS tabulation costs. However, the cost of creating 
such a file can rapidly become expensive, and valuable data relationships 
may be destroyed in the process. ' 



***** 



237 ( 



213 



* 



CONSIDERATIONS IN THE DESIGN Of USfiH-.ORl B«T~ED 

TABWi^ING ffO^tWiiRE - : V 



• ■> 



Budolph.C. M-endelssojjn . . 

Bureau of ^a^bor Statistics: 
^ (r,B • Department ^of Labors .Washington r D.-O. 



The design ot user-oriented software must, begin with the 



identification of the. users and the problems the^'.wish to 
solve t -T.hen r ,at the highest technical levels/ the ' 1 

% 1 » 

requirements, are exclusively those of designing a language 
tha'tN will allow the" users to communicate their problems to 



the^computer. This is- followe&'by* the design of a« 
generalized computer system to provide the praducfjspecif ied 
by \ the user. ' .' . .' ; „ , ' 

jiho are \ur users and what* is their problem? Our mission 
says .that the users are those who want to do tabulations. 

And,, because the software is to be' user oriented. I believe ■ 

» . » *.■•'■•■ * 

we *$re'. to assume that the user, must be- sprue one who lacks 

t „ o 

training in the computer sciences , does not can to learn* 
either how computer-s work or the step- by-step procedures, 
that get the computer to solve problems. * . * 



This may sound like a condemnation afi useps generally* 
However, I \.ntend it as an observation of our own failure to 
see- the compu.ter as a tool to be given to users to operate 



♦Materials prepared Ijor the Sub-group on Tabulation of Data. 



•* iVfc fteir>6wn professional environment* The users* should not 

be required to learn another discipline. Rather, they 

. y ' ■ : ' ■ * • r ' ■ 

« t. ' 

' should' be able to deal with the j^omputer in their own 
technical language • 

The most, flexible tool*that we can^offer users would be a 
/Natural language, Put/ there are ambiguities present in 
natural languages. „ You and I can cqpe with these 
ambiguities through combinations of subtle nuances, 
assumptions, and prompting. Computers cannot tolerate so 
much freedom. A user language to talk with computers must 
be .structuted according to ttye demands of computers. * 



Knowing that computer rigidities /ill be a constraint, bu£ 
■ t # 

, that the language" should be as close to natural as possible, 

we must. ask ourselves what language do users^employ to ' >. 

/ i 
specify a table, % Five years ago BLS undertook a study'of 

the language used by our .economists, statisticians, \, 

demographers, and other social scientists in .describing and 
* « * 

specifying tabulations* 



determining these. language characteristics was not a simple 

s 

mattet because of the range of tables BLS users specify. 4 • 

These tables* fall into 1 three broa'd classes: Those published 

in the Bureau f s bulletin^ and reports, work tables used in 

t^e production of the published data, and a third class more 

difficlt to observe." The. BLS professional personnel' is # 

239 



♦deeply^ involved.; in "research and. rely 'heavily on the Bureau»,s 

massive. data-files. The form of the-.t abulat ions from these 

j ■ " ' . **\ i* 

f iles .,'i^S/ not. predictable, because Hhe analyst typically 

.;' . . • ; V'./- ■ / /V ' . °t ■ - ; . 

engages, rn an" interactive • process; that is, Jthe study of one 
Vv : ■ ■ ; , ■■ ..■ * \ 

tafyle . leads to. new: questions which require different tables 

which geh.erate. n^w questions, arid so or) until* the analyst is 

"satisfied.- ..>" • '. • '. V '«■ ' ' •'' ;' • /■• '• 



' Our study reveale'd one dominant 'f act :■; Thefce' was no"* .• V 

agreement within the Bu'reau on. how (0 desc'tibe taI^latioh\ v ^.'^.'■ 
"■. ■-• '■ •• ■ . '"■ * - ' 

methods and table formats. Inconsistency prevailed.? Among 
the computer 'systems staff, .economist s,. st&tis^iciAjis, 
demographers, and other Spcia;! Scientists' throughout l tbe"»». 
Bureau, commonly accepted terms apd. ordinal^ ways, of ... . k 
expressing needs meant quite different things*.. Terms like • 
variable, data element, data -item, ; and field of tjgn* wete 
"interchanged, • dependirig on the context, or the u Set's, 
background Simple words* li^e ro w, line, column;; tarble,. 



summary/ and cross tabulation had varied interpretation^ 



Nor did a look at othet tabulation systems help. We 

■ r 4 s < 

concluded, then, that it would be best to pursue .an approach 

& ■ * 

that included a standard ized Iqj^gu age based on the* 
nomenclature mos^t commonly used in BLS# This appr^ch would 
improve communication among BLS social scientists, computer 

science professionals, and the computer itseltv 

, : •. 240 , . V • 



\ 



ft - 

Frbm an analysis of the, study find ingsv it became £lear to 
BLS, that in building .the standardized "language" the parts 
of the table had to be identified and named, and an 
lyiambiguous syntax had to be devised. This was done, and I 
refer you to the BLS dpcuflient,* The "Development and Use of- 
Table Producing Language, for «a discussion of the structure 
of tables # and the *s tandar dized language that evolved. 



Upon resolution of the language problem,, the BLS staff 
turned to the next step: the design of a generalized 
computer system that would respond fc to user written 
specifications for tabulations. 



Briefly, the study, had four goals: 



1. The system should te able to produce most, if not all 
of the Bureau's statistical tables. 

2. It sjiould be driven is^v a Table Producing^ Language 
that did not require the 'user to be competent in the 
computer science discipline. 

3. It should be flexible and adaptable to changing nee<jls 
for new tables and formats. 

U # it should lead the way to composition of tables for 
publication. " • 



241 



247 ' 



The first step in system , construction was to see what work 
otKers ^fcad done, particularly other national statistical, 
agencies. A United Nations quest ionai re, sent to national 
statistical agencies in Europe, Australia, and North Ameriga 
in 1972, disclosed nearl-y 50 systems that produced tables. 
->So much activity is certainly a demonstration that most 
statistical offices regard some degree of generalization 
desirable and possible* But two questions ace raised; 

... v 



1# Why so many different systems? 

• ■ *» . 

2. Why not use one of these in BLS rather th^n develop 

a new one? * * / ' 

» 

Differences in computers and data file' structures create 
- incompatibili ties . that limit the use of someone jelse's r 
programs, and much of the duplication of systems can be 
explained this way. However, this does not explain why. some 
organizations haV6 three or four d if f erent s yst ems and why 
BLS found it Useful to develop its r 'own. .The Bureau rev iewed 

'X. 

and analyzed all systems % that cou Id be found to see i f t)\py 
could meet its goals. Almost every system examined was 
capable of doing something useful. But the fact remained 
that no system met or even came clos,e to meeting all the 
Bureau's requirements, individually or collectively. 

/ ' 

242 



In building our own system we relied heavily on the 



knowledge gained 1 in the^study of other 



systems. 



Particularly significant in this regarl was the pioneering 

ka 

early and mid- I960 1 s in the construction of their Report 

Information System which to some extent paralleled the work 
in Australia • ^ 5 



work done bv the Australian Bureau of tistics in the 

Q£»fi f 5= in thp rron r nr.t i on of Hpi r Rpnnrt 

pera^tor. Another important contributor was our owrt BLS 




The work- which combined <the results of the twin studies of • 
the user languag'e' and^ generalized tabulation program 
culminated in the completion of the first publicly available 
system in 1974. It is calTed Table , Producing Language (TPL) 
and is now a.t work in oyer 1 55» installations* worldwide* 

Many users of TPL are in commercial enterprises throughout 
the United States and Canada. 'Nieee include banks, . 0 

, w 1 ' 

I 

insurance companies, computer time-sharing services, heavy 
industrial manufacturers, pharmaceutical houses, %and t 
research and planning organizations. ^But State and* 
municipal agencies acrdss the country,* and more than a dozen 
Federal agencies (including both houses of Congress) are 
also users. Among educational ins tit utions are over a dozen 
major universities. 



243 



249 



The count of TPL installations abroad, show|s fifteen national 
statistical agencies, located mostly in Europe, but ranging 
geographically from North Africa and the Mideast to . 
Australia and Thailand, United Nations installations in New 
York and Geneva use the system and also distribute it to 
member co untr ids, * 



The Table Producing Language was judged to be the best in 
'competition with eleven other leading^ contenders by the 

Committee on the Evaluation of Statistical Program Packages 

* f * 
of the American Statistical Assoc iat ion, s The Committee 

studied two principal, characteristics : tabulating power and. 

.simplicity of language. When integer scoring from one to 

five for nine different attributes within theSe two 

categories was used, all systems evaluated scored well above 

the minimum figure, w However, TPL scored the maximum 

♦ 

possible, 45 while the runner-up scored 36, 



The language differs from the tra^iti^al computer r 

languages, such .as COBOL, PL/1 and FORTRAN, in important 

ways* The latter have general application in the sense that 

. 4 $ 

they are used to solve a wide spectrum of problems in 
bu siness and science-- problems ranging from accounting, 
inventory, and production to ^weather forecasting and getting 



244 



men to the moon. But in doing so, the user mus|t give the 
computer step-by-step instructions on how . to solve i>he 
problem being presented to it. That reguires the user to 
know how computers work. 

♦ 

The Table Producing Language belongs to an emerging class of 
computet < languages called very high level, problem 

I 

oriented — very high level because they are disengaged from 
the computer , and prohlem oriented because they deal with 
narrow needs* TPL has limited application-- it can only, 
prepare tables, nothing else* On the other htand, this 
specific focus has a.llowld the embodiment of several 
advantages over the better known traditional but less 
specifically directed languages*. % 



The TPL system* already knows what a table is and how to 




generate one. It^only needs to be told the particulars 
about the one wanted. Thus, when describing the desired 

table with 'the Table Producing Language, .the user need not 

. .19 

,b go through the tedious and time-consuming effort of telling 
the computer, step by step, how to make the calculations and 
lay out the ta>ble f rame^ork.^^Mor eover , it allows Bureau 
social scientists who are not computer experts to use 

everyday common BLS language and nomenclature to,describe 

v v 

the tables. In short, TPL has reduced a burden, speeded ( 

* \ v 1 

i 

work,/ and increased t!he< ELS capacity to respond. 

• ') ' 245 ' 

. .» * • * J * 



I,, have mentioned soire good things about TPL. Now, what is wrong 
with it. First and foremost, it will only run on medium to 
large-scale IBM machines, or their equivalent, such as Ahmdahl 
and perhaps^I tel . We have had many requests for a version 
that would run on other machines. Unf ortunately, from the 
viewpoint of 'these requestors, our mission ^is to serve BLS- 
requirements. An effort to make TPL run on machines of 
'brand nanjes other than out IBM equipment would have been too 
costly. , „ . 

/ 

Secondly, the system is monolithic — the user gets ail or 
none of it. It includes special features that are closely 

allied with our needs. For example, there is emphasis on 

• -. * 

formatting tables for display in BLS publications through 

the use of electronic photo composers. This is useful to an . 

agency th^t publishes most of its extensive production irv 

table form but likely to be of little use in academic 

♦ ■ 

research. ^f . the us<er has a small machine and^small or 
limited needs, he can not just tajce the part that will hel£ 
him. f 

Efficiency could be improved. Users can bjb unaware that a 
chysen approach is much less efficient than another that, 
would give exactly the same result. For example, we'find 

users breaking problems into smaller pieces* than they 

. # 

should, resulting in extra costs at run time. -We feel the 

. 246 . 



2& ■■ 



system should protect them from these inef f icie/ricies.- An^ 
important goal, of our project was to bring the cost of 
processing very large files down to palatable 'levels and we 
have reduced these costs significantly impressive amounts, 
compared to our alternatives. B^t the costs are still more 
than we like. 

i 

In summary, although TPL is the result of a pioneering 
effort and embodies* important advances, a new effort should 
learn from its deficiencies as well. These include lack of 
portability across machines of different manufacturers, 
excessive size owing to the inclusion of special-purpose 
. facilities, and lack of adequate protection against 
excessive and unnecessary running costs* 



i 



247 



APPENDIX D 



Status Report on Selected Census Bureau Activities 



The conferees focused attention on a number of important ongoing 
and planned Census Bureau activities that were not covered tn the pre- 
pared papers. Since these activities- were not only discussed at length 
but also became the. subject of several conference .recommendations, a . 
brief status report on. their nature and prospects is provided in this 
appendix. Topics. described below include market research; dlta delivery; 
training, consultation, and other user services; computer software; r 
machine- readable data directories; computer tape files and microform. 

Market Research 

The identification of users' needs is always an early arid high- 

* / '* ' " ' 

priority activity of Bureau program. planners. Many different approaches 



'are used to determine interests tn data content, tabulations, forms of 

<1* ■ K 



data delivery, and data access a|j| use assistance such as training and ..- 

» ■ » .v 

reference materials. For example, the t 1980 census planners held "public 

* 

hearings" in 74 cities and at several national conferences, and met with 
representatives of State . governments to solicit recommendations. The 
planners also participate in the Federal Council on the 1980 Census* and 
maintain a mailing list of more than 7,000 'interested persons, to keep . ^ 
them informed through the 1980 Census Update , a newsletter that .carries 
articles askjng for users' suggestions on particular topics. Two 
planning conferences were held late in 1977 for- representatives of summary 

9 

tape processing centers' and other tape users, resulting- in more than 200 

/ * 249 • " 

1 v ' I 



•recommendations for 1980 census produots and services. To' obtain input 

L • . " * 

!to their programs, the Economic Census Staff sought suggestions concern- 

ing data" content and tabulations for the 1977 Economic Censuses from 

,/ 

hundreds of trade associations and institutes. The Bureau also maintains 



nine standing advisory committees; 

/ 



com 

\ 

.Data Delivery 



The Census Bureau is quite sensitive to the fact that effective and 
widespread use of its products is dependent upon an effective data 
delivery system which provides convenient access by novices and advanced 
users alike.: To supplement established data access points such as the 
more than 1 , OpO Federal depository libraries and its own sales facility, 
the Bureau has ^expanded its own census depository library system, is 
seeking to improve. the Summary Tape Processing Center Program, and has 
initiated a State JData Center Program. The latter program is a coopera- 
tive effort between participating States and the Bureau to improve the 

•j 

ability, of State governments to operate data dissemination and user 
services facilities for the benefit of users, in State and Jocal agencies, 
Universities, and the private sector. • • . 



'Training^oys-trftation^) and Other Services 

The user services, function of the Bureau is made up of such activi- 

ties as product promotion, inquiry handling and user consultation, 

. % 

orientation and training, and provision of reference materials and other 
user aids. The user training schedule for 1978 includes 28 course 

' 250 




offerings, ranging from the popular 4-day intergovernmental and librarians 
seminars on accessing Federal statistics to courses on using maftKine- 
readable data files and using census data t^roee't Federal requirements, 

- V 

and on maki n(T population estimates and projections. A comprehensive 
inventory of guides, directories, indexes, and other user aids is avail- 
able. The. monthly Data User News keeps users informed about new 
products (also listed in the Bureau of the Census Catalog ), training 
opportunities, and other relevant topics. Further, training and inquiry 
services have been enhanced through the placement of user services^ 

< ' \ 

specialists in the Bureau's 12 regional offices. s 
Computer Software 

\ ' The 1970 census might be remembered most for the large assortment 
of machine-readable products it produced. A. combined total of more tjian 
3,000 summary, mi crodata, and geographic reference tapes were released 
from that census. In recognition of the need by users for computer 
software to process these tape files, the Bureau developed and distributed 
data tabulation and display programs (DAUList 1-5 and COCENTS), geocoding 
software (ADMATCHvand UN I MATCH) , and computer mapprng programs (C-MAP 
and GRIDS). 

•"A study is currently .underway to identify gaps in the software 
generally available from all sources that users need to process Census 
Bureau data and geographic reference files. The study results will be 
used to determine whether the Bureau should develop additional software 
for distribution to users. ' ' 

251 



A short-lived effort was made after the 1970 census to establish-a 

V ' ■ * • • ' * ' 

software clearinghouse to provide users with a comprehensive listing 

' • •' ... 

of available programs for processing y census files. The effort may be 

»• • . . v 

revised in association with the 1980 census." 

. »■•.'. -.■ 

» • 

' Machine-Readable Data Directories - 

In order to be further responsive to the needs of users of computer- 

, oriented products, machine-readable data directories have been prepared, 
for recent products-such as the 1974 Census of Agriculture tapes. Annual 
Housing Files, and Annual Demographic Files. Similar directories wil'l 
be developed for all future public-use files*. \ - 

* » * 

Computer Tape Files and Microform 

While the needs of users of printed reports will continue Vo receive- 

***** 

. a high priority, there is a definite, and deliberate, trend towards the 

..*•■. * •) 

release of more «nd more data" on computer tape. This is tn recognition 

of the desire for the "publication" of greater quantities of detailed - 
data as well as the efficiencies o'f releasing data in this form. In 
addition to computer tape, -microform (fiche and f^lm) will be more 
• extensively utilized as a data delivery medium. The combination' of 
computer tape. and microform make #t possible for the Bureau to be j 
responsive to the growing demand for additional data without contributing 
to the % "paper explosion." • 

In summary, the Census Bureau recognizes that v i{ has ,a responsibility 
.beyond just cal lecting,' tabulating , anjJ publishing data. -Its staff is 
aware of the large andjijjtarse data 'user community and seeks in a multi- 
tude of ways, such as those outlined above/ to ;be responsive to these 
useQ. 



( 



252 



